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Performance  Evaluation  of  Set- Membership  Algorithms  for  Signals  in  C" .”  All  work  cited  below  acknowledges 
joint  sponsorship. 

The  general  purpose  of  this  research  is  the  development  and  exploration  of  new  set-membership-based 
algorithms  for  adaptive  identification  of  parametric  signal  and  system  models.  We  are  pleased  to  report 
progress  in  several  important  aspects,  both  theoretical  and  applied,  of  this  general  scope.  The  report  consists 
of  several  preprints  of  papers  in  review  by  repected  journals,  published  and  preprinted  conference  papers,  and 
some  other  supporting  material.  A  clear  understanding  of  our  progress  is  inherent  in  the  discussion  of  each 
item  in  the  following.  These  discussions  are  meant  to  illuminate  the  directions,  rationale,  and  achievements 
of  our  research,  with  the  technical  details  left  to  the  papers.  The  items  appearing  in  the  following  are  grouped 
into  papers  written  for  journals,  followed  by  conference  papers,  descriptions  of  dissertations  in  preparation, 
then  documents  showing  further  evidence  of  research  progress.  Within  each  group,  the  items  appear  in 
chronological  order.  The  contents  are  as  follows: 

1.  JOURNAL  PAPER  PREPRINTS.  DRAFT  .MANU.SCRIPTS,  AND  SU.MMARIES 

(a)  J.R.  Deller  and  S.F.  Odeh,  “Adaptive  set-membership  identification  in 
0(m)  time  for  linear  in  parameters  models,”  IEEE  Transactions  on  Signal 
Processing  (revision  submitted  10/91).  [Preprint] 

This  paper  is  a  revision  of  an  earlier  submi.ssion  which  was  based  principally  upon  the  Ph.D. 
dissertation  of  Souheil  F.  Odeh.  The  revision  includes  many  new  results  obtained  under  ONR 
sponsorship.  Reported  are  four  significant  contributions  to  the  field: 

•  A  generalization  of  all  fundamental  results  in  Optimal  Bounding  Ellip.soid  (OBE)  processing 
to  the  case  of  complex  signal  MIMO  models.  Such  models  occur  in  many  important  problems 
including,  for  example,  adaptive  beamforming  and  neural  network  learning. 

•  .\  class  of  explicitly  adaptive  OBE  algorithms  appears  in  a  journal  paper  for  the  first  time 

•  .\  suboptimal  test  for  innovation  is  developed  which  leads  to  a  class  of  OBE  algorithms  which 
empirically  perform  a.s  well  as  those  employing  optimal  checking.  This  check  admits  0(m) 
computational  complexity  which  represents  a  square  root  factor  improvement  over  optimal 
methods,  as  well  as  RL.S. 

•  Compact  parallel  architectures  are  developed  which  can  be  used  for  running  both  optimal 
and  suboptimal  algorithms  at  CJ(m)  expense. 
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(b)  J.R.  Deller,  M.  Nayeri,  and  S.F.  Odeh,  “System  identification  using  set- 
membership-based  signal  processing,”  Proceedings  of  the  IEEE  (submitted 
12/91  by  invitation  in  response  to  paper  proposal).  [Letter  of  invitation 
included  with  preprint] 

In  response  to  a  proposal  to  the  Proceedings,  submission  of  this  paper  was  encouraged  by  the 
editors  as  the  prefatory  letters  indicate.  The  paper  is  expository,  reviewing  the  general  field  of 
set-membership  bcised  identification  algorithms,  then  focusing  on  the  OBE  algorithms  which  are 
currently  of  most  interest  to  the  signal  processing  community.  Reviewed  are  adaptive  and  "nonad- 
pative’’  algorithms,  efficient  algorithms  with  suboptimal  data  checking,  and  parallel  architectures 
for  implementation.  In  addition  to  the  tutorial  value  of  this  paper,  current  research  sponsored 
in  part  by  the  ONR  grant  has  lead  to  a  unified  framework  into  which  all  OBE  algorithms  may 
be  placed.  The  paper  discusses  the  field  from  this  point  of  view,  and,  in  an  appendix,  provides 
unified  and  rigorous  theoretical  developments  which  underlie  all  major  developments  in  the  OBE 
field.  These  developments  are  scattered  throughout  the  literature,  and  in  some  cases  are  absent, 
incomplete,  or  misunderstood.  Roth  the  novice  reader  and  the  expert  should  benefit  from  this 
work. 

(c)  S.D.  Hunt  and  J.R.  Deller,  “  ‘Linearized’  alternatives  to  back-propagation 
based  on  recursive  QR  decomposition,”  IEEE  Transactions  on  Seural  Net¬ 
works  (submitted  8/91).  [Preprint] 

The  class  of  learning  methods  presented  in  this  paper  were  developed  en  route  to  the  application  of 
set-membership  principles  to  neural  network  training.  The  algorithm  is  based  upon  linearization  of 
the  dynamics  of  a  feedforward  neural  network  bcised  on  error  surface  analysis,  followed  by  training 
using  a  QR  d  .imposition  version  of  the  RLS  algorithm.  The  algorithm  can  be  used  to  train 
networks  “node-wise”  (all  weights  connected  to  a  node  updated  simultaneously  )  or  “layer- wise," 
and,  in  some  cases  all  weights  of  the  network  can  be  updated  simultaneously.  The  node-wise  case 
turns  out  to  be  theoretically  similar  to  a  method  developed  by  Azimi-Sadjadi  et  al.  (A-S),  but 
the  QR  implementation  renders  the  present  algorithm  vastly  superior  in  terms  of  numbers  and 
speeds  of  convergences.  The  reported  method,  as  well  as  the  A-S  method,  outperform  conventional 
back-propagation. 

(d)  J.R.  Deller,  M.  Nayeri,  and  M.S.  Liu,  “Connections  between  the  Fogel- 
Huang  and  Dasgupta-Huang  optimal  bounding  ellipsoid  algorithms”  (in 
preparation,  tenatively  for  Automatica)  [Draft  manuscript  included]. 

The  Fogel-Huang  OBE  (F-H  OBE)  algorithm  is  attractive  in  its  clear  interpretability,  but  in  spite 
of  statements  to  the  contrary  in  the  literature,  it  does  not  have  proven  convergence  properties. 
On  the  other  hand,  the  Dasgupta-Huang  OBE  (D-H  OBE)  is  desirable  in  its  proven  convergence, 
but  its  controversial  optimization  criterion  is  not  amenable  to  clear  intepretation  of  the  method's 
operation.  In  our  work  related  to  the  Proceedings  paper  above,  intriguing  connections  between 
D-H  OBE  and  F-H  OBE  (in  fact,  between  D-H  and  a  broadly  genearlized  version  of  F-H)  were  dis¬ 
covered.  These  connections  are  apparently  unknown  to  the  research  community,  and  are  reported 
in  this  paper.  It  is  suggested  that  these  findings  could  ultimately  lead  to  an  OBE  algorithm  with 
the  desirable  properties  of  both  methods. 

(e)  M.  Nayeri,  J.R.  Deller,  and  M.S.  Liu,  “A  converging  optimal  bound¬ 
ing  ellipsoid  algorithm  with  volume  minimization  (tentative  title)”  (in 
preparation,  tentatively  for  Automatica  (special  issue  on  signal  processing) 
).  [Summary  paper  included] 

We  found  the  OBE  algorithm  alluded  to  in  the  discussion  of  the  last  paper.  It  is  quite  possible 
that  this  will  be  a  landmark  paper  which  will  have  the  same  impact  on  the  field  as  the  original 
F-H  OBE  and  subsequent  D-H  OBE. 
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(f)  J.R.  Deller  and  M.  Nayeri,  “Unifying  the  landmark  developments  in 
optimal  bounding  ellipsoid  processing,”  International  Journal  of  Adaptive 
Control  and  Signal  Processing  (in  planning  in  response  to  recent  invitation). 
[Letter  of  invitation  &  planning  paper  included] 

The  Guest  Editor  of  this  special  issue  has  written  that  papers  with  tutorial  content  are  especially 
welcome.  This  paper  will  tie  together  in  one  source  several  of  the  key  unifying  themes  mentioned 
in  the  descriptions  above.  Accordingly,  it  will  decribe  the  general  unifying  themes,  and  lead  the 
reader  to  sources  of  information  on  rigorous  theoretical  details.  In  particular,  we  will  develope 
the  ‘‘generic”  Unified  Optimal  Bounding  Ellipsoid  (UOBE)  algorithm,  and  show  how  all  reported 
algorithms,  both  adaptive  and  nonadaptive,  are  instances  of  UOBE.  The  interesting  connections 
between  F-H  OBE  and  D-H  OBE  described  in  paper  Id  above  can  be  presented  in  this  framework. 
Finally,  the  algorithm  which  combines  the  desirable  features  of  these  two  “landmark”  algorithms 
will  be  described.  This  paper  will  reach  a  large  population  of  researchers  in  Europe  whose  work  is 
system  and  control-oriented  and  who  might  not  be  as  familiar  with  the  signal  processing  literature. 
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(g)  M.  Nayeri,  J.R.  Deller,  and  M.M.  Krunz,  “Convergence  and  colored 
noise  issues  in  bounding  ellipsoid  identification,”  Proceedings  of  ICASSP 
'92,  San  Franscisco,  March  1992  (to  appear).  [Preprint] 

This  paper  presents  the  following  new  results  a  discusssion  of  almost  sure  convergence  of  the  UOBE 
estimator  (ellipsoid  center)  under  ordinary  "‘vvhite  noise”  conditions  on  the  model  disturbances, 
then  presents  the  following  new  results  concerning  the  ellipsoid  behavior  under  various  noise 
conditions: 

•  With  white  noise  disturbances,  UOBE  algorithms  involve  ellipsoidal  bounding  sets  which 
converge  in  some  unspecified  way  to  some  unspecified  “size.”  This  result  represents  the  first 
report  in  the  literature  of  a  covergence  result  for  a  “non- D-H”  algorithm.  The  original  F-H 
OBE  paper  has  been  misinterpreted  to  mean  that  the  ellipsoid  converges  to  a  point. 

•  With  colored  noise  inputs,  the  limiting  ellipsoid  must  be  a  nontrivial  set.  Empirical  evidence 
suggests  that  the  true  parameters  lie  on  the  boundary  of  this  limiting  set. 

•  Arguments  are  made  in  support  of  the  idea  that  the  ellipsoid  may  collapse  into  a  subspace 
of  the  parameter  space  (thereby  diminishing  the  volume  of  the  ellipsoid  to  zero  without  its 
being  reduced  to  a  point)  if  and  only  if  the  input  is  not  persistently  exciting. 

(h)  J.R.  Deller  and  S.F.  Odeh,  “SM-WRLS  algorithms  with  an  efficient  test 
for  innovation,”  Proceedings  of  the  9‘^  IFAC  /  IFORS  Symposium  on  Identifica¬ 
tion  and  System  Parameter  Identification,  vol.  2,  pp.  1044-1049,  July  1991 
(written  and  presented  by  invitation).  [Reprint] 

This  paper  presents  some  of  the  ideas  concerning  suboptimal  testing  cited  in  the  des  r.ption  of 
paper  la. 

(i)  J.R.  Deller  and  S.D.  Hunt,  “A  simple  ‘linearized’  learning  algorithm 
which  outperforms  back-propagation”  (submitted  to  International  Joint 
Conference  on  Neural  Networks,  1/92).  [Preprint] 

This  paper  presents  some  of  the  key  developments  of  the  algorithm  ited  in  the  description  of 
paper  Ic. 
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Abstract 

This  paper  describes  some  fundamental  contributions  to  the  theory  and  applicability  of  optimal 
bounding  ellipsoid  (OBE)  algorithms  for  signal  processing.  All  reported  QBE  algorithms  are  placed 
in  a  general  framework  which  fruitfully  demonstrates  the  relationship  between  the  set-membership 
principles  and  least  square  error  identification.  Within  this  framework,  flexible  measures  for  adding 
explicit  adaptation  capability  are  formulated  and  demonstrated  through  simulation.  Computational 
complexity  analysis  of  OBE  algorithms  reveaJs  that  they  is  of  0(m?)  complexity  per  data  sample 
with  m  the  number  of  parameters  identified,  in  spite  of  their  well-known  propensity  toward  highly- 
selective  updating.  Two  very  different  approaches  are  described  for  rendering  the  a  specific  OBE 
algorithm,  the  set- membership  weighted  recursive  least  squares  algorithm,  of  0{m)  complexity. 
The  first  approach  involves  an  algorithmic  solution  in  which  a  suboptimal  test  for  innovation 
is  employed.  The  performance  is  demonstrated  through  simulation.  The  second  method  is  an 
architectural  approach  in  which  complexity  is  reduced  through  parallel  computation. 
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1  Introduction 


Set-membership  (SM)  identification  of  parametric  systems  is  concerned  with  the  computational  de¬ 
scription  of  feasible  sets  of  solutions  which  are  consistent  with  the  measurements  and  the  modelling 
assumptions.  SM  algorithms  have  been  the  subject  of  intense  research  effort  in  recent  years  and 
and  many  approaches  has  been  explored.  The  papers  in  [1]  and  [2]  provide  a  broad  and  current 
overview  of  the  area.  In  particular,  comprehensive  reviews  of  the  field  with  extensive  reference  lists 
are  found  in  papers  by  Walter  and  Piet-Lahanier  [3]  and  by  Milanese  and  Vicino  [4].  An  extensive 
list  of  application  examples  with  references  is  also  given  in  the  Milanese  paper.  A  tutorial  on  the 
principal  algorithm  of  interest  in  this  paper,  the  so-called  set-membership  weighted  recursive  least 
squares  (SM-WRLS)  algorithm,  is  found  in  [5]. 

One  class  of  SM  methods,  the  optimal  bounding  ellipsoid  (OBE)  algorithms^,  is  of  particular 
interest  to  the  signal  community  since  it  represents  a  merging  of  the  SM  approach  and  widely  used 
least  square  error  (LSE)  procedures  for  identifying  linear  models.  The  benefits  of  combining  SM 
considerations  (when  they  are  known)  with  LSE  processing  are  twofold:  First,  the  SM  information 
provides  a  feasible  set  of  solutions  which  complements  the  unique  LSE  estimate.  This  feasible  set 
can  help  to  compensate  for  the  restrictive  nature  of  the  assumptions  placed  upon  the  LSE  model. 
Secondly,  as  we  demonstrate  in  this  paper,  SM  knowledge  can  greatly  improve  the  efficiency  of  LSE 
identification. 

Two  aspects  of  OBE  processing  are  treated  in  this  paper.  In  a  general  way,  it  is  shown  that 
all  reported  OBE  algorithms  can  be  placed  into  a  unified  framework  which  is  clearly  related  to 
conventional  LSE  processing.  This  framework  will  embrace  explicitly  adaptive  OBE  algorithms 
which  will  be  demonstrated  as  a  first  major  contribution  of  the  paper.  The  second,  and  more 
extensive,  aspect  of  this  paper  is  concerned  with  the  computational  efficiency  of  OBE  algorithms. 
OBE  algorithms  (both  nonadaptive  and  adaptive)  entail  an  interesting  data  selection  procedure 
which  typically  discards  70  -  9.5%  of  the  incoming  data.  The  basis  for  this  selective  updating 
is  a  determination  of  whether  the  incoming  datum  is  “informative”  in  the  sense  of  refining  the 
feasibility  set.  The  selective  updating  procedure,  however,  does  not  imply  a  similar  reduction 
in  computational  load,  since  the  effort  of  checking  for  innovation  in  the  data  is  approximately 
as  expensive  as  the  updating  itself.  In  either  case,  the  processing  requires  O(m^)  floating  point 
operations^  per  incoming  datum,  where  m  represents  the  number  of  parameters  to  be  estimated. 

^The  original  algorithm  in  this  class  due  to  Fogel  and  Huang  [6]  was  called  simply  "OBE".  We  use  this  term  to 
indicate  the  broader  class  of  similar  algorithms.  The  SM-WRLS  algorithm  will  be  seen  below  to  be  a  specific  type  of 
OBE  algorithm  in  this  broader  sense. 

’One  flop  is  taken  to  be  a  multiplication  plus  an  addition  operation. 
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This  point  has  not  been  dearly  brought  out  in  the  literature.  A  second  focus  of  this  paper  is 
to  demonstrate  two  very  different  methods  for  making  a  specific  OBE  algorithm  run  in  0(m) 
time.  The  first  solution  is  algorithmic,  while  the  second  is  architectural.  The  ability  to  execute 
this  interesting  method  in  0{m)  time  makes  it  highly  competitive  with  conventional  identification 
techniques  (especially  recursive  least  squares  (RLS))  which  typically  require  0{m^)  flops  per  point. 

2  An  Adaptive  SM-WRLS  Algorithm 

2.1  The  Model  and  the  LSE  Identification  Problem 

The  basic  identification  problem  is  as  follows:  VVe  observe  a  system  which  is  generating  output 
sequence  y(-)  in  response  to  input  sequence  u(-).  Both  input  and  output  sequences  are  measurable, 
and  u(-)  is  assumed  to  be  a  realization  of  a  stationary,  ergodic  random  process.  The  system  is 
governed  by  a  “true”  model  of  form 

yin)  -  ejxin)  +  s.(n)  (1) 

in  which  x(n)  is  some  m- vector  of  functions  of  p  lags  of  y{-)  at  time  n,  and  q  lags  plus  the  present 
value  of  u(-),  and  where,  £■.(•)  is  the  realization  of  a  zero-mean,  white  noise  error  sequence.  The 
error  sequence  is  not  measurable  and  the  “true”  parameters  9,  6  are  unknown.  At  time  n  we 
wish  to  use  the  observed  data  on  t  6  [l.n]  to  deduce  an  estimated  model  which  is  similar  in  form 
to  (1), 

y(n)  =  0^{n)x(n)  +  e{n.9{n)).  (2) 

In  the  following,  the  identified  parameter  vector  will  be  unique  for  each  n  (e.g.  [7]),  but  will  change 
at  every  step.  Hence,  we  index  the  parameter  estimate  by  n.  The  error  sequence  will  depend  on 
the  choice  of  parameters,  and  we  explicitly  show  this  dependence.  Neglecting  the  error  term,  this 
model  exhibits  only  linear  functional  dependence  upon  the  parameter  vector  and  has  been  called 
a  linear  in  unknown  coefficients  (e.g.  [8])  or  linear-in-parameters  (LP)  model  (e.g.  [3]).  Special 
cases  of  the  LP  model  of  (2)  are  the  autoregressive -exogenous  input  (ARX)  and  autoregressive  (AR) 
models  (e.g.  [9]  -  [11]).  For  a  current  overview  of  methods  that  deal  with  nonlinear  models,  the 
reader  is  referred  to  [3], [4]. 

In  particular,  we  desire  the  weighted  LSE  model  for  which  ${  n)  minimizes  ^'(n)  =  ~  Y2  ^^,An(t}e^(f,9(n)). 
where  )  is  a  sequence  of  nonnegative  weights  which  may  depend  on  n.  9(n)  can  be  found  as 
the  solution  of  the  following  classical  linear  algebra  problem  (e.g.  [7]):  Given  data  (or  a  system  of 
observations)  on  the  interval  t  G  [l.n]  (n  >  m).  and  some  set  of  error  minimization  w'eights.  say 
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{An(Oi  ^  =  1,2,...,  n},  form  the  overdetermined  system  of  equations 

X{n)u-y{n),  (3) 

and  find  the  LS  estimate,  0(n),  for  the  vector  v.  X{n)  is  the  mxn  matrix  with  row 
and  i/(n)  is  the  n-vector  with  element  \/A„(i)j/(i).  Because  of  this  interpretation,  the  pair 
(2/(n),x(n))  could  appropriately  be  called  an  equation  in  many  contexts  in  the  following.  This 
term  is  not  always  satisfactory,  however.  Whereas  the  term  “datum"’  is  inappropriate  to  describe 
(y{n),x{n)),  and  “data”  can  be  misleading,  we  will  frequently  refer  to  (j/(n),x(n))  as  the  data  set 
at  time  n.  The  expression  “per  n”  should  be  interpreted  to  mean  “per  data  set.” 

In  principle,  the  LSE  .'olution  is  the  solution  to  the  normal  equations  (e.g.  [7]),  C(n)6(n)  — 
c(n),  where  C(n)  is  the  weighted  normal  matrix"*  [8,  p.  62] 

n 

C(n)  =  X^(n)J]^n)  =  ^  A4t)x(i)x^(0  .  (4) 

t=i 

and  c(n)  =  X'^{n)y(n)  =  \n{t)x{t)yit). 

A  recursive  solution  can  be  obtained  for  certain  classes  of  time  varying  weights.  Consider  first 
the  case  in  which  the  weights  are  time  invariant,  i.e.  A„(t)  does  not  depend  on  n  for  any  t.  In 
this  case,  we  one  can  use  a  contemporary  weighted  recursive  least  squares  (WRLS)  algorithm  based 
on  the  QR  decomposition  (e.g.  [7])  of  the  X{n)  matrix  of  (3).  We  shall  refer  to  this  algorithm  as 
“QR-WRLS”  to  distinguish  it  from  the  more  conventional  WRLS  algorithm  based  on  the  matrix 
inversion  lemma  (e.g.  [8], [9]  -  [11])  (MIL-WRLS)®.  QR-WRLS,  in  principle,  involves  the  application 
of  a  sequence  of  orthogonal  operators  (Givens  rotations)  to  (3)  which  leaves  the  system  in  the  form 


T(n) 

u  = 

d\(n) 

^(n  — m)xm 

_  d2{n) 

where  the  matrix  T(n)  is  an  m  X  m  upper  triangular  Cholesky  factor  [7]  of  C(n),  i.e.,  C(n)  = 
X^{n)X{n)  =  T^(n)T(n),  and  0,xj  denotes  the  i  x  j  zero  matrix.  The  system 

T(n)e{n)  =  di{n)  (6) 

is  easily  solved  using  back  substitution  [7]  to  obtain  the  LSE  estimate.  0{n).  This  procedure  can  be 
performed  in  a  recursive  manner  using  only  about  memory  locations.  When  the  n-|-  P*  data  set 

*In  many  contexts  C(n)  is  imprecisely  cailed  a  "‘covariance”  matrix.  In  fact,  limn— Tc(l/n)C(n)  is  the  covariance 
matrix  for  the  process  if  appropriate  ergodicity  assumptions  are  made. 

^With  the  exception  of  the  parallel  processing  architectures,  developments  throughout  this  paper  may  also  be 
based  upon  MIL-VVRLS.  Indeed,  almost  all  of  the  existing  .SM  algorithms  of  the  type  considered  here  are  based  on 
the  conventional  method. 
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becomes  available,  it  is  weighted  by  ^A„(n)  and  incorporated  into  the  system.  Details  are  found 
in  [12]-[14].  We  shall  use  the  name  QR-WRLS  to  refer  to  this  form  of  the  recursion.  It  will  be 
shown  how  this  formulation  makes  possible  the  solution  of  the  ellipsoid  algorithms  to  be  described 
on  contemporary  parallel  architectures  for  great  speed  advantages.  It  also  avoids  initialization 
problems  encountered  in  the  use  of  MIL-WRLS  [14]. 

The  QR-WRLS  algorithm  can  conveniently  accommodate  certain  classes  of  time  varying  weights 
of  interest  in  this  work.  The  first  is  the  case  in  which  previous  weights  are  scaled  at  time  n  by  a 
time  dependent  scalar, 


A„(t)  =  j"— Vt  <  n  -  1. 


(<) 


C(ra  -  1) 

C(')  is  a  scaling  sequence  which  depends  on  the  nature  of  the  method.  A  common  use  for  such  scaling 
is  to  effect  adaptation  by  exponential  forgetting.  In  this  case  t^{n)  =  a~^,  Vn,  where  0  <  a  <  1. 
This  scaling  is  conveniently  carried  out  in  the  course  of  QR-WRLS  by  simply  multiplying  the  matrix 
and  vector  T(n)  and  d\{n)  by  prior  to  considering  {y(n),x{n))  [13].  By  a  straghtforward 

generalization  of  the  work  in  [13],  it  can  be  shown  that  time- varying  scaling  may  be  accomplished 
by  a  similar  premultiplication  by  -  1).  Let  us  denote  the  scaled  system  of  equations  at 

time  n  -  1  by  T,(n  -  l)6>,(n)  =  <^1,5(71). 

A  second  type  of  time  varying  weights  is  used  to  achieve  adaptation  by  exclusion.  In  this  case  it  is 
desired  to  remove  some  prior  data  sets  from  the  system  prior  to  considering  (y(n),  x(n)).  Let  the  set 
of  times  corresponding  to  data  sets  to  be  excluded  be  Tn-i.  Then,  whereas  Xn-iit)  >  0,  t  £  T^-x, 
it  is  to  be  true  that  A„(t)  =  0,  t  E  T^-x-  This  case  is  accomodated  within  QR-WRLS  by  simply 
reentering  the  data  set  to  be  forgotten  with  its  previous  weight  as  though  it  represented  new  data, 
then  making  some  simple  sign  changes  in  the  algorithm  [5], [15].  Because  the  data  sets  are  removed 
by  “reversing”  the  Givens  rotations  which  originally  included  them,  this  process  is  often  call  hack- 
rotation.  It  is  notable  that  previous  data  sets  can  likewise  be  partially  excluded  using  a  similar 
back-rotation  method  [16],[17].  After  aU  desired  data  sets  are  removed,  the  system  of  equations  is 
often  said  to  be  downdated  at  time  n  —  1,  and  we  shall  denote  this  by  writing 


Td(n  -  l}ed{n}  =  dx,d(n).  (8) 

If  it  were  to  be  solved  for,  6d{n)  would  represent  an  estimate  at  time  n  —  1  without  knowledge  of 
the  excluded  data  sets. 


2.2  The  BE  Constraint  and  the  Feasibility  Set 

A  widely- research  class  of  SM  problems  is  those  involving  bounded  error  (BE)  constraints  (e.g. 
[3]-[6],[15]-[33]).  In  BE  identification,  a  pointwise  bound  on  the  true  error  sequence  is  assumed. 


Ordinarily  this  taJtes  the  form® 

slin)  <  7(n),  (9) 

where  7(')  is  a  known  positive  sequence.  It  follows  immediately  from  (1)  and  (9)  that  the  true 
parameters  must  be  in  the  set 

u;(n)  =  je  I  ((/(n)  -  <  7(n)| .  (10) 

When  intersected  over  a  given  time  range  usually  form  convex  polytopes  of  feasible  parameters,  say 
Methods  which  track  these  polytopes  [3], [4],  [18]-[21]  result  in  interesting  but 
very  complex  algorithms  which,  at  present,  are  not  suitable  for  fast  signal  processing  applications. 
OBE  algorithms  are  of  much  lower  complexity  and  work  with  an  outer  bounding  hyperellipsoid,  a 
superset  of  the  poly  tope  [6],[22]-[29].  The  ellipsoid  is  “optimized”  at  each  step  by  making  some 
measure  of  its  size  as  small  as  possible  in  light  of  the  incoming  data. 

One  of  the  drawbacks  of  the  OBE  approach  from  a  set-theoretic  point  of  view  is  that  the 
hyperellipsoidal  bounding  sets  are  sometimes  quite  “loose”  supersets  of  the  actual  feasibility  sets 
(polytopes)  (e.g.  [22],[30]).  This  problem  renders  the  resulting  feasible  superset  “pessimistic”  in 
that  it  may  contain  many  points  which  are  infeasible,  and  not  reflect  the  size  of  the  true  feasible  set. 
Whether  certain  measures  can  be  taken,  or  particular  OBE  algorithms  can  be  used,  to  minimize 
this  problem,  is  an  open  issue.  One  possible  solution  is  the  use  of  inner  bounds,  as  suggested  in 
[30], [31].  In  the  present  work  the  relative  size  of  the  bounding  set  will  turn  out  to  be  somewhat 
inconsequential.  It  is  the  information  afforded  by  the  existence  of  the  ellipsoid  which  is  important. 

2.3  Combining  the  BE  and  LSE  Problems:  The  SM-WRLS  Algorithm 

OBE  algorithms  are  fruitfully  viewed  as  a  marriage  between  the  LSE  and  BE  problems  for  LP 
models.  With  this  point  of  view,  signal  processing  engineers  have  begun  to  exploit  the  benefits  of 
BE  information  in  the  context  of  LSE  identification  problems.  In  particular,  LSE  identifiers  exploit 
no  point-by-point  information  which  can  be  used  to  ascertain  the  usefulness  of  observations.  This 
fact  manifests  itself  in  the  effective  retention  of  the  entire  parameter  space  as  a  “feasible  set,”  and 
results  in  wasteful  processing.  BE  constraints,  when  they  are  known,  provide  a  finite  feasible  set 
and  offer  the  possibility  of  including  only  data  points  which  contribute  to  the  reduction  of  this  set. 

As  mentioned  above,  the  polytope  n(n)  arising  directly  from  BE  considerations  is  not  easy  to 
track  and  manipulate.  Further.  Q(n)  is  not  clearly  related  to  the  LSE  solution.  However,  it  has 
been  shown  in  three  special  cases  of  scaling  sequences,  C(')  (recall  definition  below  (7))  ,  that  there 

®This  form  is  slightly  less  general  than  stating  asymmetrical  amplitude  bounds.  £min(n)  <  s.(n)  <  but 

the  very  slight  loss  of  generality  is  worth  the  significant  analytic  gain  afforded  by  this  assumption. 
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is  an  outer  bounding  hyperellipsoid,  say  which  contains  fi(n)  and  which  is  closely  associated 
with  the  LSE  estimate  6(n}  [6], [26], [27].  A  description  of  the  hyperellipsoid  is  embodied  in  the 
following: 

Proposition  1  Let  fl(n)  C  72.'"  be  the  feasibility  set  arising  from  BE  constraints  as  above.  Let 
6{n)  denote  the  weighted  LSE  estimate  with  associated  normal  matrix  C(n).  The  weights  used  in 
the  estimation  are  A„(-)  with  A,i(l)  >  0.  There  exists  a  hyperellipsoidal  set  of  parameter  vectors. 
fi(n)  C  72.'",  such  that  0,  6  fl(n)  C  Cl(n),  which  is  given  by 

n(n)  =  {0  I  {[0-0(n)f  #(n)[0-0(n)]}  <  l}  (11) 

where  K{n)  is  the  scalar  quantity,  K(n)  =*  0^(n)C(n)fl(n)  +  7(n)A„(t)  [l  -  7“'(t)y^(t)]  , 

and  #(n)  =  C{n)/K{n). 

Note  that  the  ellipsoid  is  centered  on  the  LSE  estimate,  9(n),  and  its  defining  matrix  is  a  scaled 
version  of  the  normal  matrix,  C{n). 

The  proof  of  Proposition  1  is  a  generalization  of  the  proofs  of  similar  results  for  special  cases 
(discussed  below)  found  in  [6]  and  [26].  Another  related  result  for  complex- valued,  multiple  input 
-  multiple  output  systems  is  proved  in  [16],[34]. 

Clearly,  the  weights  A„(-)  parameterize  fi(n)  and  presumably  can  serve  to  minimize  its  size  and 
orientation  in  the  parameter  space.  Because  we  want  to  work  with  recursive  LSE  estimation,  in 
particular  QR-WRLS,  let  us  henceforth  restrict  our  attention  to  weight  sequences  which  conform  to 
the  simple  forms  of  time  variance  described  in  Section  2.1  -  scaling  and  exclusion.  This  effectively 
restricts  to  one  the  number  of  free  parameters  available  to  control  the  bounding  ellipsoid.  The 
central  objective  of  an  optimal  bounding  ellipsoid  (OBE)  algorithm  is  to  employ  these  free  weights 
in  the  context  of  LSE  estimation  to  sequentially  minimize  the  ellipsoid  size  in  some  sense.  .A 
significant  benefit  is  that  often  no  weight  exists  which  minimizes  the  ellipsoid  size  in  some  sense, 
indicating  that  the  incoming  data  set  is  uninformative  in  the  SM  sense. 

In  a  general  sense,  reported  (nonadaptive)  OBE  algorithms  differ  in  the  scaling  sequences,  C(  ). 
used  in  creating  time  varying  weights.  Fogel  and  Huang’s  original  OBE  algorithm  (henceforth  called 
Fogel-Huang  OBEJ)  [6],  and  the  more  recent  method  by  Dasgupta  and  Huang  (henceforth  called 
Dasgupta- Huang  OBE)  [27],  are  not  presented  from  this  explicit  LSE  point  of  view,  and  this  unified 
approach  has  not  been  widely  discussed.  Some  general  ideas  along  these  lines  may  be  inferred  from 
[33]  and  a  unified  treatment  will  be  found  in  [.34].  The  set  membership  weighted  recursive  least 
squares  (SM-WRLS)  algorithm  is  the  simplest  in  this  sense,  employing  unity  scaling,  C(n)  =  1  Vn. 
We  henceforth  focus  on  SM-WRLS  because  this  absence  of  scaling  is  essential  to  achieve  the  desired 
low  complexity  algorithm.  Details  of  the  other  reported  algorithms  are  left  to  the  original  papers 
and  enhancements  by  Belforte  et  al.  [22],  and  Rao  et  al.  [23],[24]. 
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Nonadaptive  SM-WRLS  (when  based  upon  QR-WRLS)  is  comprised  of  the  following  steps:  At 
time  n, 

1.  In  conjunction  with  the  incoming  data  set  (j/(7i),x(n)),  find  the  optimai  weight,  say  A*(n), 
which  will  (prospectively)  minimize  the  size  (according  to  some  set  measure)  of  Q.(n),  say 
fj,{Cl{n)}.  (This  will  generally  require  knowledge  of  C{n  -  1)  or  T{n  -  1),  and  K(n  -  1).) 

2.  Discard  the  data  set  if  A*(n)  <  0. 

3.  Update  0{n)  using  QR-VVRLS  (see  Section  2.1). 

4.  Update  K{n)  of  Proposition  1  according  to 

K(n)  =  11  di(n)  ll'-^ -fK(n)  (12) 

with 

k{n)  =  k(n  -  1)  +  A„(n)7(ra)  (l  -  7~'(n)2/^(n))  (13) 

where  k(0)  0. 

Expressions  (12)  and  (13)  are  derived  in  [5], [15).  A  detailed  version  of  SM-WRLS  is  described  in 
[5]. 

2.4  Adaptation  by  Back-Rotation 

While  OBE  algorithms  in  general,  and  the  SM-WRLS  algorithm  in  particular,  have  been  shown  to 
have  inherent  and  fortuitous  adaptive  properties  as  a  result  of  their  optimal  weighting  strategies, 
measures  have  been  suggested  by  Deller  and  Odeh  [5].[15j-[17],  and  Norton  and  Mo  [33]  to  render 
explicit  and  controlable  adaptation.  All  adaptive  strategies  for  ellipsoid  algorithms  work  on  the 
general  principle  of  inflating  the  “current”  ellipsoid  in  some  sense  before  considering  an  incoming 
data  set.  The  basis  for  this  inflation  is  to  contain  the  shifting  true  parameters  while  at  the  same 
time  increasing  some  measure  of  “size”  of  the  ellipsoid  (see  (16)  and  (17)  below),  making  it  more 
likely  that  the  incoming  data,  with  potentially  novel  information,  will  be  selected. 

For  SM-WRLS,  simple  forms  of  adaptation  have  been  based  upon  exponential  forgetting  and 
by  exclusion  or  back-rotation  [5],[15]-[17].  Norton  and  Mo  have  also  worked  with  exponential 
forgetting  and  other  forms  of  adaptation  in  a  broader  context  [33].  While  exponential  forgetting 
is  conveniently  integrated  into  OBE  algorithms,  in  the  following,  we  shall  focus  exclusively  upon 
adaptation  methods  which  are  based  on  back-rotation  for  two  reasons;  First,  exponential  forgetting 
precludes  the  achievement  of  the  low  comple.xity  algorithm  ultimately  sought  in  this  work.  Secondly, 
due  to  the  fact  that  heavily  weighted  points  remain  influential  in  the  estimate  for  very  long  periods 
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of  time,  exponential  forgetting  has  not  been  found  to  be  as  effective  in  tracking  fast  time  variations 
in  system  dynamics  [16], [34].  In  the  case  of  adaptation  by  back-rotation,  the  system  of  equations  (6) 
is  downdated  prior  to  considering  the  data  set  at  time  n.  The  result  is  (8).  The  altered  ellipsoid  is 
centered  on  Bd{n-\)  and  has  associated  matrix  Cd{n-\)lKd(n-l)  =  Tj(n  -  l)Tj(n  -  l)/«:^(n  -  1). 

Proper  downdating  of  the  scalar  K(n  —  1)  is  easy.  Upon  rewriting  the  definition  of  «(•)  from 
Proposition  1  at  time  n  -  1, 

n— 1 

K(n  -  1)  =  -  l)C(n  -  l)0(n  -  1)  4- ^  .  (14) 

(=1 

it  becomes  immediately  clear  that  if  data  sets  at  times  t  £  are  eliminated  from  the  system, 
then  the  normal  matrix  is  simply  replaced  by  its  downdated  version  and  all  deleted  terms  should 
be  removed  from  the  sum  on  the  right.  Correspondingly,  the  downdated  version  of  (12)  written  for 
time  n  -  1  becomes 

n 

-  1)  =11  -  1)  11^  +  K(n-l)-  ^  A„_i(t)7(t)  ^1  -  7“*(t)y^(0)  (15) 

t=i 

and  the  term  k{n  -  1)  in  (13)  should  be  replaced  by  kd{n  -  1)  which  is  defined  to  be  the  term  in 
square  brackets. 

A  wide  range  of  adaptation  strategies  is  inherent  in  the  general  formulation  described  above, 
many  of  them  computationally  inexpensive.  We  have  found  two  forms  of  adaptation  by  back- 
rotation  to  be  particularly  effective.  These  are: 

1.  Windowing.  Let  I  be  a  fixed  “window  length.”  For  each  n  >  I,  let  7^_i  =  {u  -  /}. 

2.  Selective  Forgetting.  At  time  n  check  some  predetermined  criterion  indicating  whether  adap¬ 
tation  is  necessary.  If  so,  select  the  set  to  be  forgotten  according  to  some  other  criterion. 

The  first  case  above  corresponds  to  the  use  of  a  sliding  rectangular  window  of  length  I,  outside  of 
which  all  previous  data  sets  are  completely  removed.  The  estimate  at  time  n  covers  the  range  [n  — 
I  -I-  l,n].  The  windowing  technique  is  made  possible  by  the  ability  to  completely  and  systematically 
remove  data  sets  at  the  trailing  edge  of  the  window.  Only  one  back-rotation  is  required  prior  to 
optimizing  at  time  n,  and  this  is  only  necessary  if  A*_,(n  -  /)  /  0. 

.At  significantly  higher  computational  expense,  smoother  windows  can  be  implemented  by  back- 
rotation.  This  is  accomplished  by  partial  rotation  of  an  included  data  set  according  to  a  schedule 
which  gradually  eliminates  the  data  set  [16],[17].  Since  each  included  data  set  is  back-rotated 
multiple  times,  the  computation  required  to  effect  such  a  window  is  frequently  not  warranted  by 
the  benefits  of  slightly  improved  frequency  resolution.  For  details,  see  [16]. 


Selective  forgetting  represents  a  very  general  class  of  techniques  in  which  the  data  sets  to  be 
removed  ^rom  the  system  are  selected  according  to  certain  user  defined  criteria.  The  selection 
process  can  be,  for  example,  to  remove  (or  downweight)  only  the  previously  heavily  weighted  data 
se*^s,  to  remove  the  data  sets  that  were  accepted  in  regions  of  abrupt  change  in  the  signal  dynamics, 
or  to  remove  the  data  sets  starting  from  the  first  data  set  and  proceeding  sequentially.  Whatever 
the  criterion,  a  fundamental  issue  is  to  detect  when  adaptation  is  needed  to  improve  the  parameter 
estimates.  An  example  is  explored  in  the  simulation  studies  below. 

2.5  Optimization 

In  the  nonadaptive  case,  Fogel  and  Huang  [6]  suggest  two  set  measures  on  Cl(n)  for  optimization. 
These  measures  may  also  be  applied  to  the  downdated  system  extant  at  time  n  -  1  if  adapation  is 
employed.  For  generality,  we  assume  downdating  in  the  following.  If  adaptation  is  not  used,  it  is 
only  necessary  to  drop  subscripts  “d"  where  they  occur.  The  first  Fogel  and  Huang  set  measure  is 
the  determinant  of  the  matrix  ^(n). 

=  det{#(n)}  (16) 

and  the  second  is  the  trace, 

Ht{Cl{n)}=  tr  {^(n)}.  (17) 

^v{fi(n)}  is  proportional  to  the  square  root  of  the  volume  of  Q{n)  while  ^t{fi(n)}  is  proportional  to 
the  sum  of  its  semi-axes.  The  following  is  a  slightly  generalized  version  (to  accommodate  adpatation 
by  downdating)  of  results  found  in  [6],[26].  Further  generalizations  are  found  in  [34]. 

Proposition  2  Let  be  the  set  of  times  corresponding  to  data  sets  to  be  excluded  by  back- 

rotation  prior  to  time  n.  Then  let  A„(t),t  6  [l.nj  indicate  the  weights  to  be  used  to  optimize  (16) 
or  (17)  at  time  n.  Under  the  adaptation  by  exclusion  policy,  for  t  6  [l,n  -  1]  and  t  ^  T„_i. 

A„(t)  =  A„_i(t).  For  t  e  [1, n  -  1]  and  t  £  Fn~\,  A„(t)  =  0.  Then, 

/.  if  it  exists,  A*(n)  which  minimizes  the  volume  measure  (16)  is  the  unique  positive  root  of  the 

quadratic  equation 

Uvi^)  —  -t-  oj  A  -f  Co  =  0  ( 18) 

where.  02  =  {(m  -  1  )7( n)G]J(n)} , 

oi  =  {(2m  -  1)  -I-  7“*(n)£^(n,0j(n  -  1))  -  Kd(n  -  1  )7~*(  fi)Gj(n)}  7(n)G;i(n). 
and  oq  =  m  [7(n)  -  s^{n,9i(n  -  1))]  -  K^ln  -  l)Gj(n), 
in  which  all  quantities  are  defined  above  except  Gd(n)  =*  x^(n)Cj’(n)x(n). 

2.  if  it  exists,  the  weight  A*(n)  which  minimizes  the  trace  measure  (17),  is  the  unique  positive 
root  of  the  cubic  equation 

f((  A )  =  ftaA^ 62-^^  +  +  ^0  (19) 
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with 


63  =  '^{n)Gl{Gd{n)  -  Id{n  -  l)Hd{n))  , 
f>2  =  ^yin)Gd(n)[Gd(n)  -  Id{n  -  \)Hd{n)], 

61  =  Hd{n)Gd(n)Id{n  -  l)Kd(n  -  1) 

-2Hdin)Id(n  -  1){7(r)  -  £^{n,Od(n  -  1))] 

-Gd(n)£^{n,6d(n  -  1))  +  3'i{n)Gd(n), 
and  bo  =  7{”)  -  £^(n.dd{n  -  1))  -  Hd(n)Id{n  -  l)Kj(n  -  1), 
where  Hd{n)'=  x^(n}Cj^(n)x(n}  and  Id{n)‘‘=  trCd(n). 

For  later  computational  considerations  we  note  the  following.  In  the  context  of  QR-WRLS,  the 
inverse  normal  matrix,  C'^^(n  -  1),  never  appears,  yet  it  is  needed  to  compute  Gd(n)  and  Hd(n). 
The  following  circumvents  this  problem: 

Lemma  1  In  the  context  of  QR-WRLS,  the  scalars  Gd{n)  and  Hd{n)  can  be  computed  using 
0{m^/2)  flops. 

Proof:  Write 

Gd{n)  =  ®^(n)T7‘(n  -  l)T~’^(n  -  l)®(n)  ='  g^(n)g(n)  =  j)  g{n)  ||^  (20) 

in  which  1|  ■  ||  denotes  the  I2  norm.  Now  x(n)  =  T^(n-  l)3(n),  and  T^(n-  1)  is  lower  triangular, 
so  g{n)  is  found  by  back-substitution  using  (m^  +  m)/2  flops.  Now  note  that 

Hd(n)  =  xT'{n)T;Hn-  l)T-T(n-  l)T;\n-l)Tf{n-l)x(n)  (21) 

=  g^(n)T^^{n  -  l)T~^(n  -  l)g(n)  ='  h'^{n)h(n)  =  \\  h(n)  H'^ 

and  back-substitution  can  once  again  be  employed.  □ 

3  Implementing  SM-WRLS  in  0{m)  Time 

3.1  Complexity  Considerations 

A  precise  comparison  of  the  computational  loads  of  various  OBE  algorithms  is  given  in  [34],  The 
number  of  flops  (see  footnote  3)  required  for  the  (generally  adaptive)  SM-WRLS  algorithm  under 
consideration  here  may  be  approximated  by 

fopt  ~  O(cim^)  -I-  bO{c2m^)  -|-  pOic^m^)  (22) 

where,  p  is  the  average  number  of  data  sets  accepted  per  n;  b  is  the  average  number  of  back-rotations 
per  n:  and  Ci,C2  and  C3  are  small  numbers  (all  in  the  range  0.5  -  2.5)  which  depend  upon  whether 
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QR-WRLS  or  MIL-WRLS  is  used.  For  QR-WRLS  upon  which  we  have  principally  focussed  in  this 
paper,  Ci  =  0.5,  C2  =  2,  and  cj  =  2.5.  The  first  term  is  due  to  the  procedure  which  checks  for 
information  in  the  incoming  data.  The  others  are  attributable  to  adaptation,  and  solution  update, 
respectively.  If  either  an  exponential  forgetting  factor  or  a  non-unity  scaling  sequence  (other  OBE 
algorithms),  is  used,  an  additional  term  of  0{m^/2)  must  be  added.  Apparently,  the  SM-WRLS 
algorithm,  cis  presently  formulated,  is  an  "0(m^)”  process.  The  objective  of  the  section  below  is  to 
demonstrate  two  distinct  methods  for  reducing  the  effective  complexity  to  0{m),  thereby  making  a 
SM-WRLS  algorithm  a  desirable  alternative  to  standard  RLS-based  methods  from  a  computational 
point  of  view. 

Two  approaches  are  taken.  The  first  is  an  algorithmic  solution  which  will  reduce  the  true 
complexity  to  0(m)  for  processing  on  a  sequential  machine.  The  second  is  a  hardware  solution 
which  reduces  the  basic  algorithm  to  0(m)  parallel  complexity,  with  even  further  reduction  possible 
if  the  algorithmic  measures  are  combined. 

3.2  0(m)  Processing  on  a  Sequential  Machine 

From  a  signal  processing  point  of  view,  one  of  the  most  interesting  aspects  of  an  OBE  algorithm 
is  its  inherent  ability  to  select  only  data  points  which  are  informative  in  the  sense  of  refining 
the  feasible  set.  The  fact  that  typically  70  -  95%  of  the  data  are  rejected  by  this  criterion  (e.g. 
[6],[17],[23]-[29])  would  seem  to  imply  a  remarkable  savings  in  computation.  However,  this  is  only 
true  to  the  extent  that  the  checking  for  usefulness  of  the  incoming  data  set  is  negligibly  expensive 
compared  with  the  inclusion  of  it  in  the  estimate.  We  have  seen  above,  however,  that  the  checking 
procedure  is  not  inexpensive  (see  lead  term  of  (22))  -  a  point  which  has  not  been  made  clear  in 
reported  research.  The  approach  taken  here  is  to  render  the  checking  procedure  an  0(m)  process 
in  a  manner  which  does  not  (empirically)  degrade  performance  of  the  algorithm. 

Before  detailing  the  methods,  some  points  about  the  use  of  the  approximation  “C?(m)’'  are 
necessary.  The  first  concerns  a  practical  matter.  The  objective  in  the  following  is  to  reduce 
the  computational  complexity  of  the  algorithms  to  an  average  of  0(m)  flops  per  n.  It  will  be 
appreciated  that,  without  data  buffering,  the  data  flow  is  still  limited  by  the  worst  case  O(m^) 
computation.  However,  if  a  buffer  is  included,  the  algorithm  easily  be  structured  to  operate  in 
0(m)  average  time  per  n.  Further,  by  using  interrupt  driven  processing  of  the  checking  procedure, 
it  may  be  possible  to  reduce  the  average  time  even  further.  Other  points  concern  algorithmic 
details.  We  reiterate  that  the  use  of  a  unity  scaling  sequence  (SM-WRLS  algorithm)  is  required  in 
order  to  avoid  an  invariant  0(m^j2)  flops  per  n.  We  specifically  assume  the  use  of  this  algorithm 
below,  although  the  0{m)  checking  procedure  to  be  developed  does  not  depend  on  this  choice. 
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Secondly,  (22)  indicates  that  an  adaptive  strategy  must  involve  a  sufficiently  small  average  number 
of  back- rotations  per  n  so  that  the  O(m^)  adaptation  term  in  (22)  does  not  overwhelm  gains  made 
by  reducing  the  checking  cost.  In  the  windowing  case  above,  for  example,  we  would  expect  that 
6  p  and  the  adaptation  is  not  unduly  expensive.  A  selective  forgetting  strategy  which  meets  this 
condition  will  also  be  illustrated  in  the  simulations  below.  Finally,  we  note  that  even  if  the  checking 
procedure  can  be  made  C>(m),  terms  bO(m^)  and  (typically  b  v  p)  persist  in  (22).  This 

means  that  to  truly  achieve  0(vi)  comple.xity,  b  and  p  must  be  0(l/m).  For  large  m.  this  will 
not  be  always  be  the  case.  In  fact,  some  experimental  evidence  suggests,  not  unexpectedly,  that  p 
increases,  rather  than  decreases,  with  increasing  m.  For  '‘large”  m  (conservatively,  say,  m  >  10), 
therefore,  it  is  the  case  that  the  comple.xity  is  reduced  to  O(pm^)  by  0(m)  checking.  It  should  be 
clear  however,  that  neither  0(m)  nor  O(pm^)  complexity  can  be  achieved  if  the  checking  procedure 
remains  O(m^)  .  We  therefore  pursue  an  0(m)  test  for  information  in  an  incoming  data  set. 

In  principle,  the  information  checking  procedure  for  the  volume  or  trace  algorithms  consists  of 
forming  either  fu(A)  or  F((A)  of  (18)  and  (19),  then  solving  for  the  positive  root.  However,  since 
02  >  0,  and  b^  >0,  i  =  1,2,3,  there  is  at  most  one  such  root  in  either  case,  and  the  test  reduces  to 
one  of  checking  the  zero  order  coefficient  for  negativity  [35].  When  the  test  is  successful,  then  the 
root  solving  and  updating  procedes.  requiring  the  standard  MIL-  or  QR-WRLS  load,  plus  a  few 
operations  for  finding  the  optimal  weight.  In  spite  of  Lemma  1.  the  most  expensive  aspect  of  this 
information  test  is  the  computation  of  the  quantity  Gd{n)  or  Hi{n),  each  requiring  0{m^l2)  flops. 
The  trick  to  making  the  SM-WRLS  algorithm  an  C?(m)  procedure  is  to  find  a  way  to  avoid  the 
computation  of  Gd(n)  or  Hi{n)  at  each  n.  We  first  develop  a  method  which  accomplishes  this  for 
the  “volume"  algorithm,  then  argue  that  it  pertains  to  the  “trace”  optimization  criterion  as  well. 

Let  us  denote  the  estimation  error  vector  at  time  n  by 

e{n)  =  9.~d(n).  (23) 

-T 

It  follows  immediately  from  (11)  that  6  (n)C(n)d(n)  <  K{n)  .  While  it  is  tempting  to  view  K(n) 
as  a  bound  on  B{n)  (see  discussion  of  the  Dasgupta-Huang  algorithm  below),  it  is  important  to 
note  that  each  side  of  this  inequality  is  dependent  upon  A„(n).  In  fact,  let  us  temporcirily  write 
the  two  key  quantities  as  functions  of  A„(ra)  :  C(n,A„(n))  and  K:(n,A„(n))  and  consider  the  usual 
volume  quantity  to  be  minimized  at  time  n. 

p,.{(l{n)}  =  det  |K(n.  A„(n))C~‘(n,  A„(n))j  .  (24) 

It  is  assumed  that  enough  data  sets  have  been  included  in  the  normal  matrix  at  time  n  -  1  so 
that  its  elements  are  large  with  respect  to  the  data  in  the  incoming  data  set.  For  the  choice  of 
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weighting  strategy  employed  here,  the  quantity  det  C(n,  A„(n))  is  readily  shown  to  be  monotonically 
increasing  with  respect  to  A„(n)  on  the  domain  (0,  cx))  [16],  with  C(n,  0)  ’=  C{n  -  1,  A*_i(n  -  1)). 
Under  the  assumption  above,  detC(ri,An(n))  will  not  increase  significantly  over  reasonably  small 
values  of  An(ra).  The  attempt  to  maximize  det  C(n,  A„(n))  in  (24)  causes  a  tendency  to  increase 
An(n)  in  the  usual  optimization  process.  However,  the  attempt  to  minimize  K(n,  A„(n))  generally 
causes  a  tendency  toward  small  values  of  A„(n),  unless  a  minimum  of  K{n,  An(n))  occurs  at  a  “large” 
value  of  A„(n).  To  pursue  this  idea  and  further  points  of  the  argument,  we  use  two  key  facts  about 
K(n,  A(n)); 

Proposition  3  K(n,An(n))  has  the  following  properties:  I.  On  the  interval  An(n)  £  (0,oo), 
K(ra,A„(n))  is  either  monotonically  increasing  or  it  has  a  single  minimum.  2.  K(n.  Xn{n))  has 
a  minimum  on  A„(n)  £  (0,cx))  iff 

£^(n,0j(n  -  1))  >  7(n).  (25) 


To  verify  this  result  we  need  the  foUowing  which  is  proven  in  [34]: 
Lemma  2  For  n  >  I,  the  sequence  «(•)  can  he  computed  recursively  as 


K{n)  =  K.i{n  -  1)  +  A„(n)7(n)  -  A„(n) 


c^{n.9i(n-  1)) 
1  +  Xn(n)Gd{n) 


Proof  of  Proposition  3:  For  simpliciw,  we  write  Ar,(n)  as  A.  Using  (26)  from  Lemma  2,  we  can 
write 

.  d.(  dKin.X)  6'j(n)j(n)X^  +  2Gd(n)7(n)X  +  l7(n)-  s^(n,9d(n  -  1))] 

aA  "  Gd(n)2A^  +  2Gi(n)A+ 1  '  ' 


nt\)  '2[G^(n)  +  7(n)Gjin)j£^{n.0d(n  -  1)) 

dX^  ~  (G2(n)A2  +  2Gd(n)A+  1)^  '  '  ’ 

The  denominator  of  Q(A)  is  positive  on  A  G  (O.oc)  and  therfore  has  a  root  on  A  G  (0,c»)  iff  its 
numerator  does.  The  the  numerator  is  a  convex  parabola  with  its  minumum  at  A  =  -1/Gj(n)  <  0. 
and  it  therefore  has  a  unique  positive  root  on  the  interval  (0,  oc)  iff  7(n)  -  s^{n.9d{n  -  1))  <  0. 
Further  Q(A)  >  0  for  all  A  >  0.  so  the  root,  if  it  exists,  will  correspond  to  a  minimum  of  K(n.  A).n 
Accordingly,  it  can  be  argued  that:  If  det  C(n,  A„(n))  is  increasing,  but  not  changing  signifi¬ 
cantly  over  reasonably  small  values  of  A„(n),  then  it  is  sufficient  to  seek  A„(n)  which  minimizes 
K(n,A„(n)).  If  K(n,A„(u))  is  monotonically  increasing  on  Xn(n)  >  0,  this  value  is  A„(ti)  =  0  which 
corresponds  to  rejection  of  the  data  set  at  time  n.  It  suffices,  therefore  to  have  a  test  for  a  minimum 
of  K(n.Xn(n))  on  positive  A„(n).  A  simple  test  is  embodied  in  condition  (25)  which  determines 
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whether  the  square  of  the  current  residuaJ  exceeds  the  upcoming  error  bound.  If  this  test  is  met, 
it  is  then  cost  effective  to  proceed  with  the  standard  optimization  centered  on  (18).  Otherwise,  the 
explicit  construction  and  solution  of  oq  of  (18)  can  be  avoided. 

In  fact,  this  suboptimal  test  for  innovation  is  similar  to  that  used  in  the  Dasgupta-Huang  OBE 
algorithm  reported  in  [27].  The  suboptimal  test  of  Dasgupta  is  to  accept  the  incoming  data  set 
only  iF  e^{n,6{n  -  1))  <  7(ra)  -  «;(n  -  1).  This  inequality  likewise  tests  for  a  minimum  of  k 
with  respect  to  An(ra),  and  differs  in  form  from  (25)  because  of  the  scaling  factors  (see  (7)  and 
surrounding  discussion)  which  depend  on  the  optimal  weights,  C(^  ~  U  =  (1  “  ^he 

Dasgupta  case.  While  this  dependence  precludes  the  construction  of  a  reasonable  expression  in 
A*(n)  with  which  to  minimize  the  set  measure  /^„{D(7^)},  the  Dasgupta  hyperelbpsoid  nevertheless 
does  have  a  volume  at  each  n,  and  it  is  therefore  possible  to  attempt  to  apply  the  above  arguments. 
A  problem  arises  in  the  Dasgupta-Huang  case,  however,  because  the  relative  independence  of  C{n) 
and  A„(ri)  is  not  tennably  argued.  Therefore,  the  simplified  test  in  this  case  is  not  subject  to  the 
■‘same”  Justification  as  (25).  Interestingly,  however,  if  A„(n),  which  is  already  constrained  to  [0, 1) 
in  the  Dasgupta-Huang  algorithm,  happens  to  be  very  small  at  a  particular  n,  then  the  algorithm 
approaches  the  case  of  unity  scale  factors  (C(n)  «  1)  as  in  SM-WRLS,  and  it  can  be  argued  that 
the  normal  matrix  changes  only  slightly.  In  this  case,  but  only  in  this  case,  the  arguments  above 
are  applicable.  Of  course,  artificially  constraining  the  weights  to  be  small  for  aU  n  destroys  the 
optimization  process  in  the  Dasgupta-Huang  method,  so  that  this  analysis  provides  support  for  the 
suboptimal  test  only  for  isolated  and  infrequent  times.  Dasgupta  and  Huang  argue  simply  that 
«(n)  is  “a  bound  on  the  estimation  error,”  and  should  be  minimized.  This  claim  has  been  disputed 
by  Norton  and  Mo  [33]  and  is  not  clearly  supported  here.  Generally,  the  arguments  in  support  of 
(25)  are  valid  only  for  certain  types  of  scaling  sequences  which  do  not  cause  the  estimation  process 
to  “forget”  too  quickly.  This  is  not  generally  the  case  with  the  Dasgupta-Huang  strategy. 

Before  proceding,  another  comparison  to  the  Dasgupta-Huang  OBE  algorithm  should  be  made. 
One  of  the  principal  advantages  of  their  method  is  the  ability  to  conveniently  prove  convergence  of 
the  ellipsoid  to  a  point  (9,).  The  original  Fogel  and  Huang  paper  [6]  is  often  cited  as  proving  that 
the  bounding  ellipsoid  in  the  Fogel-Huang  OBE  algorithm  converges  to  a  point  under  ordinary 
conditions  on  £.(  •).  In  fact,  the  paper  only  proves  this  convergence  for  the  case  of  unity  weights  so 
that  the  fundamental  optimization  process  is  not  taken  into  account.  No  known  proof  of  this  de¬ 
sirable  result  for  the  Fogel-Huang  OBE  algorithm,  or  for  any  version  of  SM-WRLS  exists,  whether 
optimal  or  suboptimal  checking  is  used.  While  the  estimate  itself  is  guaranteed  to  converge  asymp¬ 
totically  under  proper  conditions  on  £.(•)  (e.g.  [10]),  the  ellipsoid  is  not  guaranteed  to  diminish 

Subscripts  "d”  are  omitted  here  since  their  algorithm  does  not  involve  this  form  of  adaptation. 
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asymptotically.  However,  we  have  found  empirically  that  the  optimal  and  suboptimal  tests  tend 
to  produce  an  ellipsoid  with  a  similar  "size’’  at  a  given  point  in  the  signal,  and  to  produce  similar 
estimates,  in  spite  of  the  fact  that  the  suboptimal  test  tends  to  use  fewer  data  (see  simulations 
below). 

A  further  interpretation  of  (25)  is  possible  which  also  allows  the  extension  of  the  test  to  include 
“trace”  minimization  as  well.  .4  simple  rationale  for  the  suboptimal  test  is  as  follows; 


Proposition  4  If  the  test  of  (25)  is  met.  then  a  positive  optimal  weight  exists  for  either  the  volume 
or  trace  criterion. 

Proof:  We  show  that  the  zero  order  cofficients  oq  and  6o,  of  (18)  and  (19),  respectively,  will  never 
be  positive  if  the  test  is  met.  Consider  oq  -  m  [7(ra)  -  s^{n.9^(n  -  1))]  -  K.d{n  -  l)Gd{n).  Write 
(11)  for  the  downdated  case  at  time  n  -  1.  then  multiply  through  by  K;j(n  —  1).  The  result  is 
[6  -  dd(n  -  \)^  C  d{n  -  \  )[6  ~  6  d[n  -  I )]  <  K^(ra  -  1 ).  If  Cdin-  1)  is  positive  definite,  this  implies 
that  Kdin)  >  0.  Further  Gd{n)  =  x^(n)Cc[(n  -  l)x(n)  >  0,  so  aq  <  0  if  the  test  (25)  is  met. 
Now,  consider  bo  =  7(n)  -  s^(n.0d(n  -  1))  -  Hd(n)Id(n  -  l)Kd(n  -  1).  By  similarly  showing  that 
Hd(n)  >  0  and  Idin  -  1)  >  0.  the  desired  result  for  the  trace  criterion  is  obtained.  □ 

In  the  volume  case,  for  example,  the  suboptimal  check  tests  whether  oq  is  negative  if  the  term 
Kd{n~  IjGd(n)  is  neglected.  This  ignored  term  is  always  negative  and  becomes  small  as  n  increases. 

For  a  given  set  of  preceding  optimal  weights.  A^fl) _ ,A*(n  -  1).  the  suboptimal  test  will  never 

fail  to  accept  an  data  set  which  would  have  been  accepted  by  the  optimal  test.  A  similar  analysis 
applies  to  the  coefficient  bo  of  the  trace  algorithm. 

With  the  inexpensive  test  afforded  by  (25).  the  checking  procedure  becomes  an  0(m)  procedure. 
Consequently,  for  sufficiently  small  p,  the  SM-WRLS  algorithm  can  be  run  in  0{m)  time  per  n. 

3.3  Simulation  Studies  and  Further  Discussion 

OBE  algorithms  which  do  not  include  explicit  adaptation  measures  have  been  demonstrated  in 
numerous  papers  cited  above.  Our  principle  objective  here  is  to  briefly  illustrate  the  use  of  the 
adaptive  and.  particularly,  the  0(m)  suboptimal  checking  procedures. 

We  consider  the  identification  of  a  time  varying  .ARflf)  model  of  the  form 

14 

yin)  -  ^a,.(rj)y(n  -  i)  +£.(u).  (29) 

1=1 

A  set  of  "true”  .AR  parameters  were  derived  using  iinear  prediction  analysis  (e.g.  [36])  of  order  14 
on  an  utterance  of  the  word  "seven”  by  an  adult  male  speaker.  The  original  speech  waveform  is 
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shown  in  Fig.  1  to  illustrate  the  time  varying  nature  of  the  signal.  A  7000  point  sequence,  y{n), 
was  generated  by  driving  the  derived  set  of  parameters  with  an  uncorrelated  sequence,  f.(n),  which 
was  uniformly  distributed  on  [-1,  1], 

The  speech  signal  was  not  used  directly  in  this  study  for  a  simple  r'^ason.  The  problem  of 
determining  proper  bounds  for  the  model  error  is  a  nontrivial  one  for  real  speech,  and  a  proper 
description  of  this  point  would  seriously  sidetrack  the  present  discussion.  Similarly,  space  would 
not  permit  a  careful  discussion  of  the  performance  of  the  algorithm  in  cases  in  which  rpuse  bounds 
are  uncertain  or  violated.  The  predecessor  (optimal,  nonadaptive  case)  methods  to  those  illustrated 
here  have  been  successfully  applieH  to  real  speech  and  these  results  are  reported  in  [26]  where  some 
of  these  more  difficult  issues  are  also  adressed.  In  the  same  vein,  the  artificial  noise  permits  carefully 
controlled  statistical  properties.  The  model  noise  used  here  is  uncorrelated,  and  this  algorithm  in 
its  present  form  will  converge  to  a  bias  if  this  is  not  the  case.  .4  discussion  of  colored  noise,  while 
interesting  and  useful,  is  beyond  the  scope  of  this  paper.  The  interested  reader  is  referred  to 
[24], [28], [34],  While  the  uniform  distribution  chosen  here  has  become  conventional  in  testing  OBE 
algorithms,  it  is  worth  noting  that  the  performance  of  the  methods  is  bound  to  be  affected  to  some 
extent  by  the  choice  of  this  distribution.  This  becomes  clear  upon  recognizing  that  ihe  algorithm 
tends  to  favor  the  acceptance  of  data  at  time  n  when  the  residual  is  large.  In  some  preliminary 
runs  with  bounded  but  nonuniform  distributions,  we  do  not  find  these  effects  to  be  very  significant. 

In  the  simulations  below,  we  apply  the  conventional  and  adaptive  SM-WRLS  algorithms  with 
'‘volume”  optimization  to  the  identification  of  the  a,,  parameters.  We  discuss  several  simulation 
results.  Only  the  result  for  04.  is  shown  in  each  case  to  conserve  space.  Of  the  14  parameters, 
a4,  emerged  as  the  most  difficult  to  track  and  gave  the  worst  performance.  Each  figure  shows  two 
curves,  one  for  the  true  parameter,  the  other  for  the  estimate  obtained  by  the  algorithm  under 
study. 

In  several  previous  studies,  it  has  been  demonstrated  that  that  OBE  algorithms  have  inherent 
adaptive  capabilities  by  virtue  of  their  optimal  data  weighting  strategies,  even  when  not  explicitly 
designed  to  be  adaptive  (e.g.  [24]-[29]).  The  adaptive  capability  of  “nonadaptive”  OBE  algorithms 
is  somewhat  unpredictable  and  fortuitous,  especially  for  fast  time  variations.  Further  they  are 
subject  to  divprgence  if  the  true  parameters  move  outside  the  feasible  set.  Nevertheless,  SM-WRLS 
and  other  OBE  algorithms  often  demonstrate  this  inherent  ability.  The  present  example  is  contrary. 
Figu'^e  2  illustrates  the  result  of  applying  SM-WRLS  to  the  time  varying  waveform.  The  estimate 
clearly  fails  to  appropriately  track  the  true  parameter  in  this  case. 

Before  proceeding,  let  us  use  thp  present  result  to  emphasize  a  principal  point  made  in  the  paper. 
The  result  of  Fig.  2  is  achieved  using  only  the  fraction  p  —  0.079  of  the  data.  Other  examples 
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are  found  in  the  literature  where  good  tracking  is  achieved  with  similar,  or  even  smaUer,  fractions 
of  the  data  used.  It  is  important  to  keep  in  mind,  however,  that  the  computational  complexity 
of  the  SM-WRLS  algorithm  is  only  a  factor  of  about  five  better  than  conventional  RLS,  and  the 
■‘p  =  0.079”  figure  must  not  be  interpreted  to  the  contrary.  Herein  lies  the  motivation  for  the 
suboptimal  checking  procedure. 

Next,  we  show  the  simulation  results  of  the  variations  on  the  adaptive  SM-WRLS  algorithm. 
Figure  3  shows  the  results  of  the  windowed  SM-WRLS  algorithm  using  windows  of  lengths  500,  1000, 
and  1500.  This  strategy  uses  the  fractions  p  =  0.221. 0.174,  and  0.143  of  the  data,  respectively,  but 
remains  an  O(m^)  process  because  optimal  checking  is  used.  Additionally,  each  time  an  accepted 
point  occurs  at  the  trailing  edge  of  the  window,  a  back-rotation  is  needed  to  effect  adaptation.  This 
implies  an  average  number  of  back-rotations  6  ss  p  per  n  (see  Section  3.1).  More  data  and  more 
rotations  than  with  the  unmodified  SM-WRLS  algorithm  are  used,  but  more  accurate  estimates 
result  and  the  time  varying  parameters  are  tracked  more  quickly  and  accurately.  As  expected, 
adapting  over  smaller  windows  tended  to  improve  time  resolution,  but  increased  the  variation  of 
the  estimate  and  increased  the  number  of  points  accepted.  Conversely,  the  longer  windows  yielded 
smoother  estimates  using  fewer  data,  but  at  the  expense  of  slower  tracking.  While  no  ,vindow  length 
in  this  range  yielded  grossly  unacceptable  estimates,  the  1000  point  window  illustrated  represents 
a  good  tradeoff  between  the  demands  of  time  and  frequency  resolution. 

Figure  4  illustrates  the  use  of  suboptimal  checking  in  conjunction  with  windowed  SM-WRLS 
with  a  window  of  length  1000.  Interestingly,  the  fraction  of  the  data  used  is  p'  =  0.087  which  is 
about  half  that  required  in  the  same  experiment  with  optimal  data  checking  (Fig.  3(b)).  This 
means  that  the  suboptimal  checking  not  only  reduced  the  computational  effort  of  checking,  but 
also  decreased  by  a  factor  of  two  the  number  of  complexity  rotations  required.  Nevertheless, 
the  estimate  trace  is  quite  similar  to  the  optimal  case,  the  only  difference  being  a  slight  increase 
in  the  variance  near  the  end  of  the  trace.  Similar  results  were  obtained  for  windows  of  length  500 
and  1500. 

The  selective  forgetting  strategy  chooses  data  sets  to  be  removed  from  the  .system  based  on  user 
defined  criteria.  Here  the  set  of  times  to  be  back-rotated  is  as  follows.  Let  t'  <  n  correspond  to 

the  “oldest”  data  set  remaining  in  the  estimate.  Then  J^n~\  =  {t' . where  the  elements  in 

the  set  are  ordered,  t'  <  ■  •  ■  <  t" .  and  t”  <  n  is  the  smallest  time  for  which  some  other  criterion 
is  met.  The  determination  of  when  to  apply  the  forgetting  procedure  and  when  to  stop  removing 
data  sets  at  a  given  time  is  discussed  in  the  following. 

The  parameter  04.  to  be  tracked  in  this  study  is  characterized  by  relatively  fast  time  variations 
in  the  time  region  2000  -  6000.  The  fact  that  the  parameters  change  relatively  slowly  in  the 


first  2000  points  induces  the  algorithm  to  accept  some  points  which,  in  turn,  causes  the  ellipsoid 
volume  to  decrease.  An  increase  in  the  “confidence”  of  the  estimate  results.  Near  time  2000,  the 
ellipsoid  volume  becomes  very  small.  When  the  parameters  move  rapidly  away  from  their  current 
location,  they  eventually  move  outside  the  ellipsoid  which  is  therefore  no  longer  a  valid  bounding 
ellipsoid.  When  this  condition  happens,  it  eventually  leads  to  a  negative  value  of  K{n).  For  a 
stationary  system,  K(n)  is  always  positive,  so  that  this  condition  indicates  that  a  violation  of  the 
theory  (in  particular,  the  violation  of  the  assumption  of  stationary  dynamics)  has  taken  place*.  .4 
similar  condition  was  also  reported  by  Dasgupta  and  Huang  [27]  while  applying  their  algorithm 
to  nonstationary  systems.  In  our  simulation  studies,  we  find  that  a  negative  /c(n)  is  often  an 
effective  indicator  of  need  for  adaptation,  and  we  use  this  criterion  as  the  prompt  to  begin  selective 
forgetting.  Whenever  accepting  a  data  set  causes  K{n)  to  become  negative,  the  algorithm  starts 
rotating  out  the  selected  data  sets  until  K{n)  becomes  positive  again. 

Figure  5  shows  the  simulation  results  of  the  selective  forgetting  strategy  described  here.  The 
fraction  p  =  0.129  of  the  data  is  accepted  by  the  estimation  procedure  and  about  73%  of  these  are 
back-rotated  for  adaptation.  This  implies  a  small  “6”  factor  of  about  0.094  per  n  so  that  adaptation 
is  not  expensive  in  this  case.  The  checking  process  is  still  O(m^),  however,  so  the  overall  process 
remains  of  0{vn?)  complexity.  Suboptimal  checking  for  the  same  experiment  is  illustrated  in  Fig. 
6.  In  this  case  p'  =  0.088  of  the  data  is  used  with  similar  results.  About  63%  of  these  data  are 
back-rotated,  so  that  b  =  0.05.5.  Once  again,  the  suboptimal  test  has  preserved  the  quality  of  the 
estimate  and  lowered  not  only  the  checking  complexity,  but  also  the  number  of  actual  rotations 
that  need  be  implemented. 

Compared  to  the  windowed  adaptive  strategies,  for  this  example  the  selective  forgetting  strategy 
yields  smoother  estimates  using  even  fewer  computations,  but  with  poorer  time  resolution.  (Recall 
that  04.  was  found  to  be  the  most  difficult  to  track  in  this  simulation,  so  that  this  result  is  the 
worst  case.)  In  general,  we  have  found  that  selective  forgetting  (as  employed  here)  generally  uses 
fewer  data  and  produces  smoother  estimates,  but  the  tracking  ability  is  not  as  reliable  (though 
sometimes  superior)  to  the  windowed  method  [16].[34].  In  fact,  the  selective  forgetting  strategy  (as 
used  here)  tends  to  outperform  windowing  in  cases  of  very  fast  time  variations  in  the  dynamics. 
The  conservative  schedule  of  back-rotations  employed  in  the  present  technique  accounts  for  this 
observation.  K(n)  >  0  is  only  a  necessary  condition  for  the  true  parameters  to  be  inside  the  current 
elhpsoid.  The  fact  that  k(  )  goes  negative  at  a  particular  time  does  not  precisely  determine  ihe  point 
at  which  system  dynamics  began  to  change.  If  the  variations  are  slow,  this  may  occur  (if  at  all)  long 
after  the  dynamics  begin  to  change.  In  fact,  K{n)  <  0  often  indicates  a  rather  severe  breakdown  of 

*  Mathematically,  <  0  indicates  an  ellipsoid  of  negative  dimensions. 
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the  process  indicating  that  the  “true”  parameters  have  moved  well  outside  the  current  ellipsoid  at 
time  n.  In  cases  of  fast  changing  dynamics  that  this  “breakdown”  occurs  rapidly  enough  to  render 
the  condition  “«(n)  <  0”  good  locator  of  changing  dynamics  which  require  “immediate”  adaptation 
to  preserve  the  integrity  of  the  process.  The  present  example  represents  a  very  challenging  case  in 
the  sense  that  variations  apparently  occur  too  rapidly  to  be  tracked  by  standard  SM-WRLS  (see 
Fig.  2),  yet  not  quickly  enough  to  allow  very  high  time  resolution  by  the  chosen  selective  forgetting 
method.  Other  methods  for  selection  leading  to  a  more  aggressive  elimination  of  past  data  may 
assist  in  the  tracking  at  the  expense  of  higher  fractions  of  data  used. 

4  Architectural  Solutions  to  Achieving  0{m)  Time 

4.1  Systolic  Architecture 

In  this  section  we  develop  parallel  architectures  on  which  both  suboptimal  and  optimal  checking 
versions  of  SM-WRLS  will  run  in  0{m)  time.  Here  the  efficiency  is  achieved  by  parallelism  so  that 
the  number  of  operations  is  effectively  reduced  by  simultaneous  execution  of  many  computations. 
Accordingly,  the  0(m)  flop  per  n  load  to  be  achieved  is  actually  a  parallel  complexity  since  many 
processors  might  be  performing  0{m)  flops  simultaneously.  From  a  temporal  viewpoint,  the  pro¬ 
cessing  is  reduced  from  the  0{m?)  time  required  to  compute  the  optimal  solution  sequentially,  to 
0{m). 

In  the  following  we  will  assume  the  use  of  SM-WRLS  (no  scaling)  for  simplicity.  Unlike  the 
sequential  case,  however,  scaling  can  be  done  in  parallel  here  and  does  not  add  a  significant  com¬ 
putational  burden.  The  modification  of  the  following  to  include  scaling  is  straightforward.  We  also 
use  ellipsoid  volume  minimization  for  optimization,  but  a  similar  machine  may  be  developed  to 
implement  trace  optimization. 

We  first  discuss  the  “nonadaptive”  case.  The  fundamental  parallel  solution  is  made  possible  by 
the  QR-WRLS  version  of  SM-WRLS.  The  main  computational  requirements  are  a  GR  processor 
(to  effectively  execute  the  QR  decomposition)  to  update  the  matrix  [T(n)  )  di(n)]  at  each  step, 
and  a  back  substitution  (BS)  processor  to  solve  for  the  scalar  G(n)  and  also  for  the  estimate  0[n) 
at  each  n.  Systolic  processors  for  these  operations,  based  on  the  original  work  of  Gentleman  and 
Kung  [37]  and  Kung  and  Leiserson  [38],  are  well  known.  It  is  the  purpose  of  this  section  to  manifest 
this  algorithm  as  a  parallel  architecture  based  on  these  processors. 

The  need  for  implementing  the  algorithm  on  a  parallel  architecture  arises  from  the  fact  that 
portions  of  the  algorithm  are  compute-bound,  specifically,  updating  the  matrix  [T(n)  |  <ii(n)]  and 
computing  the  value  G{n)  and  the  parameter  vector  9(n).  The  architecture  that  speeds  up  the 
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computation  of  these  quantities  and  satisfies  the  desirable  characteristics  of  systolic  arrays  (SA’s) 
is  shown  in  Fig.  7.  Although  this  architecture  is  based  on  SA  design  methodologies,  it  is  used 
here  to  process  one  data  set  at  a  time  (more  on  this  below),  and  therefore,  is  not  used  as  a  SA. 
This  architecture  provides  an  improvement  over  that  described  in  [39]  by  replacing  the  global 
buses  with  local  buses  for  communication  between  adjacent  cells.  For  simplicity  of  notation,  the 
figure  shows  a  purely  autoregressive  case  of  order  three,  AR(S).  Once  the  processor  is  understood, 
it  should  be  clear  that  the  architecture  is  perfectly  capable  of  handling  the  general  LP  model  case 
discussed  above.  In  the  discussion  below,  the  vector  notations  x{n)  and  0(n)  are  used,  however, 
the  architecture  of  Fig.  7  uses  the  vectors  y{n)  and  a(n)  instead  to  denote  the  special  case  AR(3), 
where  y{n)  =  [y(n  -  1)  y{n  -  2)  y{n  ~  3)]^. 

The  architecture  is  composed  of  two  SA's,  several  memory  management  units  (i.e..  First-in 
First-out  (FIFO)  and  Last-in  First-out  (LIFO)  stacks®),  multiply-add  units  (MAU’s),  multiplexers 
(MPX’s),  and  demultiplexers  (DMX’s).  The  first  SA  is  a  triangular  array  that  performs  QR 
decomposition  using  GR’s  [37,  40]  which  are  particularly  suitable  for  solving  recursive  linear  LSE 
problems.  The  diagonal  (circular)  cells  perform  the  “Givens  generation”  (GG)  operations  and  all 
other  (square)  cells  in  the  triangular  array  perform  the  GR  operations.  There  is  a  delay  element 
at  the  lower  right-hand  corner  of  the  triangular  array  that  is  used  to  synchronize  the  flow  of  the 
generated  entries  into  the  FIFO  stacks  and  to  simplify  the  control  of  these  stacks  once  they  are 
filled  and  ready  to  output  their  contents  to  the  BS  array.  The  operations  performed  by  this  array 
are  shown  in  Fig.  8  [37,  40].  Therefore,  the  triangular  array  rotates  the  new  data  set  into  the  upper 
triangular  matrix  [T(n)  j  <ii(n)],  where  the  cells  update  the  matrix  T(n)  and  the  right-hand 
column  {d\j)  cells  update  the  vector  di(n).  The  element  t,j  denotes  the  ij‘^  element  of  the  matrix 
T(n)  and  the  element  d\j  denotes  the  element  of  the  vector  </i(n). 

The  second  array  is  a  linear  array  that  performs  the  BS  operations  shown  in  Fig.  9  [38].  Note 
that  the  same  BS  array  is  used  to  solve  for  the  vectors  g(n+  1)  and  0{n)  with  the  data  provided  to 
the  appropriate  cells  in  the  required  order  by  the  FIFO  and  LIFO  stacks.  The  FIFO  stacks  feed  the 
lower  triangular  matrix  T^{n)  to  solve  for  the  vector  ff(ra  -L  1),  and  hence,  the  value  G(n  -)- 1).  The 
LIFO  stacks  feed  the  upper  triangular  matrix  T(n)  to  solve  for  the  parameter  vector  0(n).  The 
values  G(n+  1)  =  ||  g(n-l- 1)  jj^  and  ||  <i](u)  |p  are  generated  by  the  MAU’s  shown  in  Fig.  10.  The 
number  of  segments  in  each  stack  is  equal  to  the  number  of  elements  the  stack  holds.  Therefore, 
the  leftmost  stack  consists  of  m  segments,  whereas  the  rightmost  stack  has  only  one  segment. 

’The  architecture  shown  in  Fig,  7  does  not  include  any  of  the  LIFO  stacks  that  were  used  to  hold  the  matrix 
r(n)  in  the  architecture  reported  in  [39],  This  is  achieved  by  slightly  increasing  the  complexity  of  the  cells  used 
in  the  triangular  array  so  that  they  can  be  used  as  -  rage  elements  as  well.  This  is  facilitated  by  the  diagonal 
interconnections  between  adjacent  cells  which  now  constitute  the  LIFO  stacks. 
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The  system  shown  in  Fig.  7  works  as  follows.  The  first  m+1  data  sets  (with  appropriate 
weights)  enter  the  triangular  array  (from  the  top)  in  a  skewed  order,  and  the  matrix  [T(n)  ]  di(n)] 
is  generated  and  stored  inside  the  cells.  A  shift  register  with  appropriate  feedback  connection  and 
data  sequencing  can  be  used  to  hold  and  feed  the  data  set  to  the  array.  The  initial  upper  triangular 
matrix  residing  in  the  array,  and  corresponding  to  the  first  m+1  data  sets,  is  ready  after  3m  +  1 
GG  time  cycles.  The  GG  time  cycle  is  that  of  the  triangular  array  performing  the  GG  operations 
without  square  roots,  which  is  the  time  required  to  perform  five  flops  [40], [41].  In  order  to  prevent 
data  collision,  the  flow  of  data  in  the  triangular  array  moves  along  a  corresponding  wavefront  and  is 
controlled  by  the  slowest  cells  in  the  array,  viz.,  GG  cells.  The  data  are  fed  to  the  array  one  (skewed) 
data  set  at  a  time,  therefore,  the  contents  of  each  cell  remains  constant  after  the  completion  of 
the  current  recursion.  After  the  new  data  set  is  rotated  into  the  matrix  [T(n)  |  di(n)],  the  vectors 
g(n  +  1)  and  6{n)  are  computed.  All  the  tij  cells  in  the  triangular  array  load  their  contents  on  the 
tout  hnes  {tout  x),  and  then  pass  these  elements  across  the  diagonal  lines  (tout  *—  tin)  (see 
Figs.  7  and  8).  This  obviates  LIFO  stacks.  The  FIFO  stacks  are  still  needed,  however,  to  compute 
the  vector  g(n  +  1).  The  FIFO  stacks  are  filled  with  the  elements  of  the  lower  triangular  matrix 
T^(n)  as  they  are  generated.  This  is  done  by  loading  the  t,j  entry  on  the  tout  line  (tout  *-  -c)  when 
it  is  generated.  This  entry  propagates  down  the  diagonal  cells  (with  the  function  tout  tin)  until 
it  arrives  at  and  fills  the  appropriate  FIFO  stack.  For  the  cells  in  the  right-hand  column,  which 
generate  the  vector  di(n),  the  operations  are  different  because  it  is  this  column  that  constitutes 
the  LIFO  stack  for  the  vector  di(n).  Hence,  after  the  new  data  set  is  rotated  into  the  array,  all  the 
cells  in  the  right-hand  column  load  their  contents  on  the  Xgut  lines  (xout  x),  and  then  they  pass 
these  elements  down  the  column  (Xout  ^in)  (see  Figs.  7  and  8).  The  output  Xgut  leaving  the 
bottom  cell  in  this  column  passes  through  the  delay  element  and  is  routed  to  both  the  M.4U  and 
the  MPX  feeding  the  dij  elements  to  the  BS  array.  The  elements  din,  and  t^m  leave  the  triangular 
array  at  the  same  time  because  of  this  delay  element.  The  timing  diagram  of  the  triangular  array 
is  shown  in  Table  1.  In  this  table,  the  inputs  refer  to  the  elements  fed  to  the  cells  in  the  top  row. 
The  circle  (Q)  represents  the  GG  cell  and  the  square  (□)  represents  the  GR  cell  (see  Fig.  7).  The 
outputs  refer  to  the  elements  that  are  produced  in  the  array  cells  and  are  written  columnwise;  i.e.. 
the  first  column  in  the  table  represents  the  first  column  in  the  array,  and  so  on. 

The  BS  array  is  used  to  solve  for  the  vectors  g(n  +  1)  and  0(n).  The  vector  g(n  +  1)  is  solved 
using  (20)  and  the  parameter  vector  0(n)  using  (6).  Therefore,  the  vector  g(n  +  1)  is  generated 
from  the  matrix  T^(n),  which  is  residing  in  the  FIFO  stacks,  and  the  vector  x(ti  +  1)  which  is 
available.  The  entries  are  fed  to  the  BS  array  every  other  BS  time  cycle,  where  the  BS  time  cycle 
is  the  time  required  to  perform  one  flop.  .\s  the  5,  entries  are  output  from  the  left-end  processor 
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of  the  BS  array,  they  enter  the  MAU  to  generate  the  value  G{n-\-  1)  after  2m  +  1  BS  time  cycles. 
Likewise,  the  parameter  vector  0{n)  is  generated  using  the  matrix  T{n)  and  the  vector  di{n)  which 
are  stored  in  the  triangular  array.  Starting  one  BS  time  cycle  after  the  initiation  of  the  first  BS 
operation,  the  appropriate  entries  (of  the  second  BS  operation)  are  also  fed  to  the  BS  array  every 
other  BS  time  cycle.  The  parameter  vector  9[n)  is  output  from  the  left-end  processor  of  the  BS 
array  in  reversed  order  and  interleaved  with  the  vector  g{n  ->r  1)  as  shown  in  Fig.  7.  The  value 
J|  di(n)  III  is  generated  using  a  MAU  one  BS  time  cycle  after  the  last  (m‘^)  element  of  the  vector 
di(n)  is  generated.  The  timing  diagram  of  the  BS  array  is  shown  in  Table  2  in  which  the  inputs 
refer  to  the  elements  fed  to  the  shown  cells,  and  the  outputs  refer  to  the  elements  produced  by  the 
left-end  processor  in  the  array. 

The  values  K(n)  and  c^(n-|-  l,9(n})  are  then  computed,  and  hence,  the  value  n  -f-  1)  which 
determines  whether  the  new  data  set  is  to  be  accepted  or  not.  If  the  new  data  set  is  accepted,  then 
the  weighted  new  data  set  enters  the  triangular  array  and  the  same  procedure  described  above  takes 
place  producing  a  new  [T(n  +  1)  [  <ii(n  -t-  1)]  matrix  after  2m  -b  1  GG  time  cycles,  and  therefore, 
an  updated  G{n  +  2),  6{n  -f-  1),  and  K{n  -f-  1).  On  the  other  hand,  if  the  new  data  set  is  rejected, 
then  the  triangular  array  preserves  its  contents  (hold  state),  but  the  value  G(n  2)  is  updated  to 
make  the  decision  concerning  the  next  data  set.  In  the  latter  case,  the  same  T^(n  -|-  1)  matrix  is 
used  as  the  previous  T^(n)  matrix,  and  hence,  the  feedback  on  the  FIFO  stacks.  This  procedure 
is  repeated  for  every  new  data  set. 

The  computational  complexities  (in  flops  per  data  set)  for  the  architecture  of  Fig.  7  is  approx¬ 
imated  by  [16] 

/  op.  ~  0(3m)-|-pC>(llm)  (.30) 

parallel 

where  the  first  term  accounts  for  checking  and  the  second  for  solution  update,  with  p  defined  as 
usual.  As  noted  at  the  outset,  the  complexities  of  the  solution  are  parallel  complexities  in  the 
sense  that  they  denote  the  effective  number  of  operations  per  data  set,  though  many  processors 
can  be  performing  this  number  of  operations  simultaneously.  Accordingly,  the  parallel  complexity 
indicates  the  time  it  takes  the  parallel  architecture  to  process  the  data,  regardless  of  the  total 
number  of  operations  performed  by  the  individual  cells.  The  GG  and  GR  operations  constitute  the 
main  computational  load  of  the  algorithm  as  shown  in  Table  3.  In  this  table,  the  number  of  flops 
associated  with  the  GR’s  is  multiplied  by  five  to  account  for  the  GG  cycle  time.  These  operations 
are  avoided  when  the  data  set  is  rejected,  and  thus,  a  significant  savings  in  computation  time  is 
achieved. 

Suboptimal  checking  may  also  be  used  in  conjunction  with  the  parallel  processing.  In  this  case 
it  is  simply  unnecessary  for  the  processor  to  compute  the  first  three  items  in  Table  3  in  order  to 
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check  the  inconaing  data  set.  The  reduction  in  computation,  which  is  is  not  as  significant  a.s  in  the 
sequential  processing  case,  is  reflected  by  the  approximation 

/.u6op.  ~0(m)  +  p'0(llm)  (31) 

parallel 

for  small  p'  [16]. 

4.1.1  Adaptive  Compact  Architecture 

The  architecture  described  above  can  be  modified  to  improve  cell  utilization  and  to  incorporate 
adaptation  by  back-rotation.  The  basic  idea  behind  the  compact  architecture  is  to  map  the  triangu¬ 
lar  array  of  Fig.  7  into  a  linear  array  (called  the  GR  array),  that  is,  mapping  all  of  the  GG  cells  into 
one  GG  cell  and  all  the  GR  cells  that  are  on  the  same  diagonal  into  one  GR  cell.  This  constitutes 
a  permissible  schedule  because  the  projection  vector,  d,  is  parallel  to  the  schedule  vector,  s,  and  all 
the  dependency  arcs  flow  in  the  same  direction  across  the  hyperplanes  (e.g.  [42,  Ch.  3]).  In  other 
words,  this  schedule  satisfies  the  conditions  >  0  and  >  0,  for  any  dependence  arc  e  . 

The  compact  architecture  implementation  of  the  adaptive  SM-WRLS  algorithm  is  shown  in 
Fig.  II.  The  operations  performed  by  this  architecture  are  similax  to  those  of  Fig.  7  with  the 
exception  that  the  GG  and  GR  cells  are  now  capable  of  performing  back  rotation  (see  Fig.  12)  and 
are  embedded  in  a  slightly  more  complicated  modules  needed  for  scheduling.  These  modules  are 
called  GG'  and  GR',  and  are  shown  in  Fig.  13. 

This  architecture  uses  0{m)  cells  (one  GG'  cell  and  m  GR'  cells)  compared  with  0{m^)  cells 
( m  GG  cells  and  (m^  -I-  m)/2  GR  cells)  used  in  the  architecture  shown  in  Fig.  7,  and  yet  has  the 
same  computational  efficiency  per  n.  Note  however  that  the  LIFO  stacks  that  were  embedded  in 
the  triangular  array  of  Fig.  7  are  now  needed  to  hold  the  matrix  T(n). 

The  system  shown  in  Fig.  1 1  works  as  follows.  Each  data  set  (with  its  optimal  weight)  enters  the 
GR  array  (from  the  top)  in  a  skewed  order,  and  the  matrix  [r(n)  j  di(n)]  is  generated  and  stored  in 
the  appropriate  memory  units.  Note  that  the  GR  array  can  operate  in  two  modes,  forward  (i  =  -(-1) 
and  backward  (^  =  -1)  rotation  modes  (see  Fig.  12).  In  the  backward  rotation  mode,  the  data  set 
to  be  removed  is  re-introduced  to  the  GR  array  with  the  appropriate  weight.  At  the  end  of  each 
recursion,  the  FIFO  stacks  contain  the  lower  triangular  matrix  T^{n)  needed  to  solve  for  the  vector 
g{n  +  1).  and  hence,  the  value  G{n  +  1).  The  LIFO  stacks  contain  the  upper  triangular  matrix 
T{n)  needed  to  solve  for  the  parameter  vector  9{n).  The  values  G(n  -|-  1)  =  ||  g{n  -f  1)  ||^  and 
II  di(n)  11^  are  generated  by  the  MAU’s.  Note  that  the  values  which  were  propagating  downward 
in  the  triangular  array  of  Fig.  7  are  now  propagating  leftward  due  to  the  new  scheduling.  Note 
also  that  the  vector  d\{n)  is  treated  differently  from  the  matrix  T[n).  When  the  element  is 
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computed,  it  is  stored  in  an  internal  register  in  the  GR'  cell  (see  Fig.  13).  After  generating  and 
storing  the  matrix  [T(n)  |  di(n)],  the  processor  is  ready  to  compute  the  vectors  g{n  +  1)  and 
B{n)  using  the  BS  array.  The  vector  di(n)  is  downloaded  into  the  latches  which  serve  as  a  LIFO 
stack  used  in  conjunction  with  the  other  LIFO  stacks  (containing  the  matrix  T(n))  to  solve  for 
the  parameter  vector  6{n).  The  timing  diagram  of  the  GR  array  is  shown  in  Table  4  in  which 
the  input  (output)  columns  show  the  elements  that  are  input  (output)  to  (from)  the  corresponding 
GG  (O)  or  GR  (□)  cells.  Compared  to  the  triangular  array  of  Fig.  7,  it  is  noted  that  the  cell 
utilization  per  update  (or  downdate)  has  increased  by  a  factor  of  2.25  for  the  case  when  m  =  3, 
or  by  (.5m^  +  1.5m)/(m+  1)  in  general.  The  operations  and  timing  diagram  of  the  BS  array  are 
described  in  detail  above. 

The  adaptive  compact  architecture  of  Fig.  11  has  slightly  more  complicated  cells  than  that  of 
Fig.  7,  but  requires  the  same  number  of  operations  to  check  and  incorporate  a  data  set.  However,  the 
compact  architecture  processor  may  additionally  be  used  to  back- rotate  a  data  set  for  adaptation. 
The  forward  and  backward  rotation  modes  have  the  same  parallel  complexity.  Therefore,  it  is  only 
necessary  to  add  terms  of  the  form  bO(llm)  to  either  (30)  or  (31)  to  account  for  back-rotation, 
where  b  has  the  usual  meaning. 

5  Conclusions 

Two  general  contributions  have  been  made  to  the  theory  and  application  of  OBE  algorithms  for 
linear-in-parameters  models.  W^e  have  first  suggested  that  all  reported  OBE  algorithms,  both 
nonadaptive  and  adaptive,  can  be  placed  into  a  general  framework  which  is  intimately  related  to 
recursive  LSE  processing.  A  flexible  form  of  explicit  adaptation  has  been  demonstrated  within  this 
framework.  In  particular,  a  general  technique  based  on  "back-rotation”  within  the  context  of  the 
QR-decomposition  based  version  of  VVRLS  offers  a  flexible  array  of  adaptation  strategies  and  good 
tracking  ability.  Secondly,  two  very  different  approaches  to  rendering  a  specific  OBE  algorithm. 
SM-WRLS,  of  0(m)  per  n  computational  complexity  have  been  proposed.  The  computational  com¬ 
plexity  of  the  optimal  OBE  algorithms  is  of  O(m^)  flops  per  n  in  spite  of  the  highly  discriminating 
data  selection  through  set-membership  criteria.  This  fact  has  not  been  made  clear  in  the  literature. 
This  paper  has  demonstrated  both  an  algorithmic  and  an  architectural  solution  to  this  problem, 
making  the  SM-WRLS  method  superior  to  many  other  LSE  techniques  in  a  computational  sense. 
In  signal  processing  applications,  this  computational  advantage  is  complemented  the  existence  of 
the  feasible  set  of  solutions  for  which  many  other  interesting  purposes  may  be  found. 
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Figure  1:  Acoustic  waveform  of  the  utterance  “seven”  upon  which  the  time  varying  system 
simulation  studies  is  based. 


Figure  2:  “Nonadaptive”  SM-WRLS  algorithm  applied  to  the  estimation  of  parameter  04, 
p  =  0.079  of  the  data  is  used,  but  the  estimate  fails  to  track  the  true  parameter. 
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Figure  3:  Windowed  SM-WRLS  with  oplimai  data  checking  applied  to  the  estimation  of  parameter 
a^..  The  window  lengths  are  la)  500,  ibl  1000.  ard  (c)  1500  points,  and  the  fractions  (ai  p  =  0,221. 
i  b  f  0.174.  and  (c)  0.1  43  of  the  data  are  used  in  the  estimation. 
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Figure  4:  Windowed  SM-WRLS  with  suboptimal  data  checking  applied  to  the  estimation  of  pa¬ 
rameter  04, .  The  window  length  is  1000  points  and  the  fraction  p  -  0.087  of  the  data  is  used  in 
the  estimation. 
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Figure  5:  “Selective  forgetting”  SM-WRLS  with  optimal  data  checking  applied  to  the  estimation 
of  the  parameter  04,.  The  criterion  for  selective  removal  of  past  points  is  described  in  the  text. 
The  fraction  p  =  0.129  of  the  data  is  used  by  the  estimation  procedure  and  the  adaptation  is 
computationally  very  inexpensive. 


Figure  6:  “Selective  forgetting”  SM-WRLS  with  suboptimal  data  checking  applied  to  the  estimation 
of  the  parameter  04..  The  criterion  for  selective  removal  of  past  points  is  described  in  the  text. 
The  fraction  p  =  0.088  of  the  data  is  used  by  the  estimation  procedure  and  the  adaptation  is 
computationally  very  inexpensive. 
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Figure  8:  The  operations  performed  by  the  cells  used  in  the  triangular  array  of  Fig.  7.  (a)  The 
Givens  generation  (GG)  cells,  (b)  the  Givens  rotation  (GR)  cells,  and  (c)  the  delay  element. 
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Figure  9:  Operations  performed  by  the  back  substitution  array,  (a)  The  left-end  processor  and  (b) 
the  multiply-add  units.  The  initial  y,,.„  entering  the  rightmost  ceU  is  set  to  0. 


If(Xin  =  0){ 

c  =  1 
s  =  0 

1 

else{ 

x(n)  =  [x(n-l)2  +  5(Xin)2]^^ 
c  =  x(n-l)  /  x(n) 
s  =  Xin/x(n) 


*out 


(c,s) 

x(n-l) 


(c,s) 


x(n) 


(b) 


x(n)=  cx(n-l)  +  sXjjjS 
^out  =-sx(n-l)5  +  cxinS 


Figure  12:  Operations  performed  by  (a)  the  GG  and  (b)  the  GR  cells  used  in  the  modules  of  Fig.  11. 
6  =  +1  (-1)  for  rotating  the  data  set  into  (out  of)  the  system. 
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Table  1:  Timing  diagram  of  the  triangular  array  of  Fig.  7. 
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Table  2:  Timing  diagram  of  the  back  substitution  array  of  Fig.  7. 
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-Abstract 


This  paper  is  concerned  with  membership  (SM)  identification  which  refers  to  a  class  of  algorithms 
which  uses  certain  a  priori  knowledge  about  a  parametric  model  to  constrain  the  solutions  to  certain 
sets.  The  emerging  field  of  SM-based  Signal  Processing  is  receiving  considerable  attention  and  is  becoming 
increasingly  popular  around  the  world.  This  paper  initiaOy  surveys  the  types  of  problems  and  solutions 
being  researched,  then  focusses  on  identification  techniques  of  particular  currency  in  the  signal  processing 
field.  Specifically,  the  ca.se  in  which  bounds  on  the  model  errors  are  known  has  be  of  particular  interest 
to  SM  researchers.  We  show  that  these  "bounded  error"  (BE)  algorithms  can  be  combined  with  various 
forms  of  least  square  error  (LSE)  signal  processing  algorithms  with  interesting  and  beneficial  consequences. 
.A  general  framework  embracing  all  currently  used  BE/LSE  algorithms  is  developed,  then  strategies  for 
adaptation  and  for  implementation  on  parallel  machines  are  discussed.  Computational  comple.xity  benefits 
are  considered  for  the  various  algorithms.  The  paper  is  tutorial,  leaving  many  of  the  formal  details  to 
appendices  which  presents  a  theoretical  treatment  of  the  key  results.  These  appendices  serves  to  unify 
many  related  results  appearing  in  the  literature. 
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the  true  parameters  ©..Illustrated  is  the  case  in  which  the  parameters  comprise  a  real  vector 
of  dimension  two .  6.5 
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1  Approximate  computational  complexities  in  average  number  of  cflops  per  data  set  for  the 
various  techniques  discussed  in  the  text,  m  is  the  number  of  parameters  in  the  model;  k 
the  dimension  of  the  output  vector;  p  and  p'  represent  the  average  number  of  data  sets 
accepted  per  n  in  the  optimal  and  suboptimal  cases,  respectively  (typicaUy  p'  <  p)\  and 
b  and  h'  are  the  average  number  of  back-rotations  performed  per  n  in  the  optimal  and 
suboptimal  cases,  respectively  (typically  h'  <  6).  For  each  sequential  algorithm  scaling  or 
adaptation  by  exponential  forgetting  require  O.dm^  +  (k -h 0.5)m  cflops  for  each  procedure.  If 
both  procedures  are  to  be  used,  they  can  be  combined  and  implemented  at  about  the  same 
cost  as  a  single  procedure.  In  the  parallel  cases,  scaling  and  exponential  forgetting  can  be 
achieved  at  virtually  no  cost.  In  the  parallel  processing  cases,  the  loads  in  the  table  represent 
parallel  comple.xities  (see  text),  and  results  are  for  the  case  k  =  1  since  architectures  for  the 
MO  case  have  no  been  devloped . 
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System  identification  is  concerned  with  the  deduction  of  a  mathematical  model  of  a  dynamical  system  based 
on  measurable  signals  and  other  attributes  of  the  physical  situation.  The  principal  focus  of  this  paper  will 
be  upon  the  paradigm  in  which  sufficient  information  exists  to  specify  a  good  ('True")  parametric  form 
for  the  underlying  dynamics,  and  the  identification  problem  is  reduced  to  correctly  parameterizing  the 
mathematical  form. 

In  a  broad  sense,  set-membership  (SMJ  identification  is  concerned  with  the  description  of  sets  of  param¬ 
eter  solutions  which  are  consistent  with  the  measurements  and  the  modelling  assumptions.  Accordingly, 
SM  identification  is  sometimes  called  parameter  bounding  identification  or  similar  names.  The  name  “SM” 
identification  derives  from  the  fact  that  an  SM  algorithm,  in  principle,  ascertains  whether  a  particular 
parameter  vector  is  a  member  of  the  feasible  set. 

SM  identification  is  novel  in  its  pursuit  of  sei  solutions  rather  than  particular  solutions  which  are  sought 
by  conventional  methods.  A  feasible  set  which  arises  as  a  consequence  of  SM  processing  is  a  reflection  of 
the  assumptions  made  about  the  "true”  model,  and  its  "size”  is  inversely  proportional  to  the  amount 
of  information  available  about  the  "true”  model.  The  fundamental  benefit  of  this  approach  is  that  it 
yields  solutions  which  are  based  only  upon  tenable  modelling  assumptions.  A  set  of  solutions  consistent 
with  known  information  can  be  preferable  to.  or  complemetary  to,  a  single  solution  based  on  tenuous 
assumptions.  If  an  appropriate  SM  algorithm  e.xists.  it  is  only  necessary  to  have  sufficient  modelling 
information  to  provide  a  sufficiently  small  feasible  set  for  a  given  purpose.  For  example,  a  resulting  set 
might  be  small  enough  so  that  its  centroid  would  provide  a  good  model  in  some  application. 

We  remark  that  the  given  description  of  an  SM  identification  algorithm  above  does  not  necessarily 
exclude  methods  whose  solutions  are  a  single  parameter  vector.  However,  a  consistent  theory  requires 
that,  as  time  progresses,  feasible  sets  be  subsets  of  their  predecessors  (see  below).  The  class  of  algorithms 
that  produces  an  invariant  single  point  estimate  is  certainly  a  very  uninteresting  one.  We  will  discuss  this 
point  with  respect  to  the  conventional  least  square  error  (LSE)  solution  in  the  paper. 

.\fter  discussing  the  general  modelling  and  identification  issues  and  defining  notation  in  .Section  2.  this 
paper  will  focus  on  four  classes  of  identification  problems; 

1.  "Other''  SM  Problems.  The  principal  focus  of  this  paper  is  upon  SM  methods  currently  being 
employed  in  signal  processing  applications.  In  Section  3  we  begin  a  taxonomy  of  SM  methods  and 
survey  problems  which  are  outside  the  scope  of  the  present  paper. 

2.  The  LSE  Problem.  The  purpo.se  of  this  brief  discussion  in  Section  4  will  be  to  view  this  well-known 
problem  in  relation  to  the  SM  approach  in  preparation  for  further  developments. 

3.  The  Bounded  Error  (BE)  Problem.  We  continue  our  discussion  of  the  taxonomy  of  SM  methods  in 
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Sectio/1  5  with  the  BE  problem.  This  class  of  SM  algorithms  is  predicated  upon  a  model  with  additive 
errors  whose  magnitudes  are  assumed  bounded.  A  vast  majority  of  the  research  on  SM  identification 
to  date  has  focused  on  this  problem  and  a  variety  of  algorithms  has  resulted.  The  purpose  here  will 
be  to  introduce  the  problem  and  review  this  body  of  research.  At  the  “bottom’’  of  the  BE  class  of 
techniques,  we  will  encounter  the  ellipsoid  bounding  algorithms  in  which  the  LSE  and  BE  problems 
interface. 

4.  Combined  LSE  /  BE  Problem.  The  heart  of  this  paper  is  Section  6  in  which  we  formulate  and  discuss 
the  Unified  Optimal  Bounding  Ellipsoid  ( UOBE)  algorithm  which  represents  an  explicit  combination 
of  the  two  classes  of  problems  above  for  the  linear  parametric  model.  The  UOBE  algorithm  is 
actually  a  class  of  algorithms  that  embraces  many  LSE/BE  algorithms  proposed  in  the  literature. 
It  will  be  discovered  that  the  benefits  of  combining  BE  considerations  (when  they  are  known)  with 
LSE  processing  are  twofold;  First,  the  BE  information  provides  a  feasible  set  of  solutions  which 
complements  the  unique  LSE  ("infeasible’'  by  virtue  of  its  uniqueness)  estimate.  This  feasible  set 
can  help  to  compensate  for  the  e.xtremely  restrictive  nature  of  the  assumptions  placed  upon  the  LSE 
model.  .A  colored  noise  sequence,  for  example,  represents  a  violation  of  the  basic  tenets  of  LSE 
modelling  which  might  be  ameliorated  by  the  BE  considerations.  Secondly,  it  will  be  shown  how  BE 
knowledge  can  greatly  improve  the  efficiency  of  LSE  identification. 

In  its  focus  on  Problem  4  above,  this  paper  provides  a  what  might  be  called  a  “signal  processing” 
perspective  on  the  field  of  SM  identification.  By  this  we  mean  that  we  approach  the  problem  with  a 
predisposition  toward  linear  models  and  LSE  processing  which  are  firmly  entrenched  and  successfully 
employed  in  many  signal  processing  applications.  The  authors’  principal  interest  in  SM  theory  has  been 
its  implications  for  complexity  improvement,  architectures,  adaptation,  and  bias-reduction  in  linear  LSE 
algorithms.  The  work  of  Fogel,  Huang  and  colleagues  [‘24].[41],[49],[51].[99]-[101]  also  falls  into  this  realm 
and  this  relationship  will  be  explored  in  detail.  A  different  perspective  on  this  field  is  provided  by  the  work 
of  a  number  of  research  groups  in  Europe,  most  of  whom  approach  SM  identification  with  an  interest  in 
control  and  system  science.  These  researchers,  whose  work  will  be  discussed  in  the  material  to  foUow,  have 
focused  on  a  broad  array  of  algorithms  and  models,  mostly  in  conjunction  with  the  BE  constraint.  This 
work  has  tended  to  focus  on  the  development  and  analysis  of  novel,  sometimes  very  complex,  identification 
algorithms  for  bounding  feasibility  sets.  While  extremely  interesting,  this  work  has  not  yet  yielded  methods 
which  are  as  immediately  applicable  to  practical  problems  as  the  well  known  L.SE  approaches  discussed 
here'.  We  review  these  “BE”  developments  in  Section  -o  and  direct  the  reader  to  specific  information  al)out 
this  interesting  work.  Recent  surveys  of  the  SM  field  which  focus  on  the  BE  research  are  found  in  papers 

'This  is  not  to  say  that  applications  have  not  occured.  Some  e.xamples  are  cited  below. 
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by  Walter  and  Piet-Lahanier  [121],  and  by  Milaucoe  and  Vidno  [75].  Both  survey  papers  contain  extensive 
and  useful  reference  lists.  The  reader  interested  in  a  much  lighter  tutorial  on  a  specific  form  of  UOBE 
algorithm,  the  “SM-WRLS”  algorithm,  is  directed  to  the  paper  by  Deller  [27]. 

Whatever  the  particular  interest  in  pursuing  SM  algorithms,  it  is  clear  that  all  researchers  are  excited 
about  their  tremendous  potential  for  application  to  problems  of  practical  importance.  Milanese  and  Vicino 
[75],  for  example,  list  a  broad  range  of  areas  to  which  SM  techniques  have  been  applied.  Among  them  are 
applications  to  biology  and  chemistry  [14],[82],  [130];  pharmacokinetics  [44], [73];  time  series  analysis  [118]; 
economic  modelling  [76];  speech  and  image  processing  [29],[30],(108],[1 1 1];  ecology  [58], [109];  measurement 
[1 1],[13],[104],[110];  and  robust  adaptive  control  [2],[24],[39],[5.5].[60],[62],[67],[H2],[1 17].  Recently,  artificial 
neural  network  training  algorithms  have  been  the  subject  of  studies  involving  the  SM  algorithms  [2.5].[50]. 
SM  algorithms  have  also  been  explored  with  regard  to  their  tracking  abibty  for  adaptive  identification 
[16],[17],[24],[32],[86],[87]-[89].  Finally,  another  novel  way  in  which  BE  methods  have  been  applied  is  to 
the  problem  of  model  structure  identification  [114].  Because  of  this  significant  potential  for  application, 
SM  algorithms  continue  to  be  the  subject  of  intense  research  effort. 

2  Formalities 

In  this  brief  section,  we  formally  define  notation  for  the  identification  problem  to  be  studied  and  discuss 
some  important  aspects  of  the  modelling  problem. 


2.1  General  Identification  Problem 

The  general  modelling  setup  employed  in  the  discussion  as  follows:  We  assume  that  we  are  observing  some 
physical  discrete  time  system  which  is  generating  a  complex- valued  vector  sequence,  y(-)  of  dimension  k. 
in  response  to  complex  vector-valued  input  u{-).  The  sequence  u(-)  is  assumed  to  be  a  realization  of  an 
ergodic,  wide  sense  stationary  stochastic  process.  Both  input  and  output  sequences  are  measurable.  The 
consideration  of  a  complex,  multiple-input — multiple -output  (MIMO)  system  will  generalize  many  of  the 
results  found  in  the  literature.  Of  course,  the  real  or  complex  single-input — single-output  (SISOJ  system 
is  contained  in  this  analysis  as  a  special  case,  .\lthough  many  of  the  developments  in  the  literature  are 
explicitly  for  SISO  systems,  these  are  trivially  generalized  to  an  arbitrary  ( finite)  number  of  inputs  ( MESO). 
Of  the  remaining  developments,  most  are  concerned  with  SO.  but  impbcitly  or  explicitly  MI.  hence  .MISO. 
systems.  With  regard  to  the  dimensions  of  inputs  and  outputs,  therefore,  the  developments  here  differ  from 
previously  published  results  principally  in  terms  of  the  generalized  number  of  outputs.  Upon  occassion. 
we  will  wish  to  discuss  a  result  from  the  literature.  In  this  case  we  shall  remark  that  we  are  dealing  with 
a  SISO  or  MISO  system,  and  let  fc  =  1  and  the  output  and  error  (defined  below)  be  denoted  in  regular 
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typeface,  y(-)  and  £(■)>  to  denote  scalars.  Though  SI  systems  do  occur  in  such  discussions,  we  shall  not 
have  occassion  to  use  the  scalar  notation  for  the  input  sequence. 

At  time  n,  mathematical  model  of  the  form 

y[t)  =  ^[t,  &{  n).y,u,e.p.q,  r]  +  e(t.0(u))  ( 1 ) 

is  proposed  to  account  for  the  dynamics  of  the  physical  system.  For  any  time  t,  is  a  fc-vector  of  functions 
of  the  “present”  input  u(t),  and  p.  q,  and  r  lags  of  the  sequences  y(-)<  w(  ).  and  e(-,0(n)),  respectively. 

is  parameterized  by  a  matrix  0(n),  and  £(-,0(f/))  is  a  complex  Ar-vector  error  sequence  which  depends 
upon  the  parameterization.  In  all  models  of  interest  in  SM  analysis,  the  additive  error  sequence  appears. 
In  general,  the  model  will  depend  upon  the  time  n  at  which  we  are  constructing  the  model  (we  may  have 
different  information  at  different  times).  .As  we  shall  see  below,  however,  the  only  unknown  in  the  model 
will  be  the  parameterization.  Hence,  the  dependence  of  the  model  upon  n  will  arise  through  the  parameters 
alone.  .Accordingly,  we  show  0(n)  as  a  function  of  the  modelling  time  n. 

It  is  assumed  that  a  “true”  time-invariant  model,  of  form.  say. 

y{f)  =  ^.[t,0..y,u.£.,p.,q..r,]  +  e.(f).  (2) 

is  exactly  accounts  for  the  observed  dynamics.  While  the  form  of  S',  is  known,  the  “true”  parameters.  0.. 
are  unknown  and  must  be  sought  by  the  identification.  Naturally,  we  take  *?'.  p,  q,  and  r  of  the  proposed 
model  to  be  equivalent  to  their  “true”  counterparts.  The  “true”  noise  sequence,  e.(  ),  is  generally  not 
known  on  a  sample-by-sample  basis,  but  certain  of  its  properties  are  known  (e.g.,  second  order  statistical 
properties)  and  are  attributed  to  the  proposed  mode!  error.  e(-.0(/i)).  Whether  “local”  information  about 
the  error  sequence  is  available  or  not  is  one  of  the  distinguishing  characteristics  of  a  SM  identification 
problem.  Frequently,  identification  approaches  (in  particular,  the  LSE  approach)  are  based  on  asymptotic 
properties  of  the  sequence  £.(■)•  -Asymptotic  properties  fail  to  provide  pointwise  information  with  which 
to  pare  down  the  space  of  parameter  estimates.  For  example,  second  order  statistical  properties  of  the 
sequence  do  not  provide  much  specific  information  about  the  value  e,{n)  at  a  particular  n.  This  is  to  be 
contrasted  with  a  SM  problem,  in  which  known  attributes  of  the  error  sequence  (or.  infrequently,  of  other 
aspects  of  the  model)  are  available  at  every  modelling  time  n.  The  following  problem  statement  reflects 
this  class  of  constraints. 

Problem  1  (General  SM  Identification  Problem)  Observations  y{t),  u{t).  t  G  [l.n]  are  "knou-rr  to 
have  been  generated  by  a  “true''  model  of  form  (2)  whose  error  sequence  has  a  specified  set  of  attributes, 
say  An.  on  that  time  range.  We  propose  a  model  of  form  (I)  with  ^  p  =  p.,  q  =  q..  r  -  r.,  irhose 

parameters.  0..  are  unknown,  but  whose  error  .sequence  has  properties  An  on  the  given  time  range.  Find 
the  feasible  set  of  parameters.  Flin).  such  that  for  each  0  G  Ft{ii).  the  propo.sed  model  is  consistent  inth 
the  observations. 
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A  SM  problem  will  be  said  to  be  ill-posed  if 


Q(n  +1)2  n  =  1,2 .  (3) 

If  (3)  were  not  true,  it  would  be  the  case  that  there  exists  a  potential  parameterization  of  the  "true"  model 
which  is  consistent  with  the  observations  on  t  G  [l.  n  +  1]  but  not  those  on  (  £  [1.  n].  In  turn,  this  implies 
the  potential  for  a  time-varying  "true"  system,  in  violation  of  the  assumption  about  this  system.  This  is 
an  indication  that  there  is  something  inconsistent  in  the  -'pecification  of  the  error  attributes,  or  that  the 
data  do  not  conform  to  the  assumed  "true"  model. 

2.2  “LP”  vs.  “non-LP”  Models 

.Models  of  form  (1)  can  be  dichotomized  into  those  which  are  are  linear  in  the  parameters  (LP)  sought, 

and  those  which  are  not  (non-LP).  With  regard  to  the  general  form  (1),  we  see  that  any  model  in  which 

^  h  as  explicit  nonlinear  terms  in  the  matrix  &(n)  is  immediately  non-LP.  For  example^. 

y(t)  =  &^{n)A(t)&{n)  +  e(t,0(n))  (4) 

where  A(f)  is  some  m  x  m  matrix  of  functions  of  the  lags  of  y{-)  and  u(-).  is  clearly  non-LP.  .\  model 
cannot  be  LP,  therefore,  unless  it  can  be  written  in  the  form 

y(t}  =  0^(n)x{t} £(t.0{n)).  (.5) 

This  is  a  necessary,  but  not  a  sufficient  condition  for  a  model  to  be  LP.  however.  second  necessary 
condition  is  that  the  vector  sequence  x(  )  contain  no  functions  which  have  samples  of  the  error  sequence 
e(-,0(n))  as  arguments.  One  frequent  occurrence  of  this  non-LP  type  of  mapping  appears  in  the  so-called 
output  error  model  (e.g.  [52], [121]).  For  a  SISO  system^  the  output  error  model  takes  the  form  (5)  (with 
y(-)  =  y( •).“(•)=  “(■).  and  e(-,0)  =  f(-.0)  scalars),  where 

x{t)  =  {y(t  -  I)  •  ■  ■  y(t  -p)  u(t)  u(i  -  1)  ■■■  u(t  -  q)]^  (6) 

in  which  y(-)  represents  the  sequence  y(-)  -  e{-.0).  A  second  important  non-LP  model  is  the  SISO 
autoregressive  -  moving  average  with  exogenous  input  (.ARMAX)  model  which  is  of  form  (5)  with 

x(0  =  [y(t  -  1 )  ■■■  y(t  -  p)  u(t)  u(i  -  \  )  •  •  •  u(t  -  q)  s(t  -  1.0)  ■  ■  ■  s(t  -  r.O)]^  (7) 

^Throui^hout.  superscript  H  denotes  the  Hermitian  transpose. 

Henceforth,  whenever  a  .SISO  system  is  mentioned  in  the  paper,  it  is  implicit  that  the  model  signals  and  parameters  are 
real.  This  is  for  two  simple  reasons:  l.To  avoid  superfluous  detail.,  and  notation,  and  1.  To  accurately  represent  other  research. 
The  general  results  of  this  paper  are  perfectly  applicable  to  the  complex  SISO  case.  In  the  real  case,  " H"  denotes  the  real 
transpose. 
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where  r  >  1.  The  autoregressive  -  moving  average  (ARM A)  model  is  a  special  case  of  the  ARMAX  with 
no  u  terms  present  in  (7).  Details  on  these  non-LP  models  are  found,  for  example,  in  [45], [52], [69]. 

LP  models  are  characterized  by  difference  equations  of  form  (5)  in  which  x(t)  is  any  m-vector  of 
functions  of  the  lags  of  y(-)  and  u(  )  at  time  t.  A  special  SISO  case  is  the  autoregressive  with  exogenous 
input  (ARXJ  model  in  which 

x(t)  =  [y(t  -  1)  •  •  ■  ij{t  -  p)  u(t)  u(t  -  [)  •  •  •  u(t  -  q}]^  .  (8) 

It  is  conventional  to  denote  the  parameters  of  the  .-\RX  model  by 

0  =  [a,  •  •  ■  Up  Co  Cl  •  •  •  c,]"  (9) 

so  that  the  .ARX  system  can  be  described  in  terms  of  the  difference  equation 

P  7 

y{t)  -  -  i)  +  Y^CjU{t  -  j)  +  s{t.&).  (10) 

1=1  j=0 

.A  pure  autoregressive  (AR)  model  is  a  special  case  of  the  ARX  model  in  which  no  u  terms  appear. 

3  “Other”  SM  Problems 

Before  turning  to  the  main  SM  problems  of  interest,  we  return  to  the  broad  definition  of  SM  identification 
given  in  the  opening  paragraphs  and  note  the  potential  for  many  other  types  of  algorithms  within  the 
framework  of  the  SM  algorithm  definition, 

•A  taoconomy  of  SM  methods  is  shown  in  Fig.  1.  S.M  techniques  are  seen  to  be  first  subdivided  into 
those  concerned  with  bounding  parameters  of  input-output  descriptions  of  systems  (identification),  and 
those  dealing  with  bounding  state  estimates  in  state  space  formulations  (state  estimation).  While  it  is  the 
former  class  of  techniques  which  is  treated  in  this  paper,  it  is  the  latter  which  is  the  subject  of  the  seminal 
papers  on  SM  theory.  The  reader  is  referred  to  the  early  papers  of  Schweppe  [102],  Witsenhausen  [131]. 
and  Bertsekas  and  Rhodes  [15]  which  treat  the  bounding  of  state  estimates  as  a  consequence  of  bounded 
errors.  More  recent  work  on  the  state  estimation  problem  appears  in  [1].[54].[63].[66].[79],[98].  .A  significant 
number  of  papers  in  Russian  have  also  been  published.  In  fact,  according  to  Kurzhanski  and  V'alyi  [66]. 
some  of  the  earliest  reported  work  on  this  subject  appears  in  the  Russian  paper  by  Krasovski  [61].  For  an 
extensive  list  of  papers  in  Russian,  .see  [66].  While  most  of  the  work  on  state  estimation  has  strong  ties  to 
the  identification  methods  to  be  discussed  in  the  present  paper,  the  papers  by  .Anan'ev  and  Kurzhaskii  [1] 
and  Morrell  and  Stirling  [79]  represent  an  interesting  departure  form  the  bounded  error  assumption.  These 
papers  are  concerned  with  bounded  sets  of  probability  distributions  for  n  priori  and  a  posteriori  state 
estimates.  These  constraints  result  in  bounded  sets  of  conditional  mean  estimates  and  error  covariance 
matrices. 
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Contemporary  research  into  SM  methods  has  focused  to  a  much  greater  extent  on  the  second  major 
subdivision  concerned  with  bounding  of  parameter  sets  in  input-output  models.  Most  of  this  work  has 
treated  the  BE  problem,  though  at  least  one  broader  class  of  constraints  has  been  studied.  Combettes  and 
Trussed  [20]-[22]  have  rigorously  investigated  feasibility  sets  which  arise  as  a  consequence  of  "true”  prob- 
ablistic  attributes  of  meaisurement  noise.  These  sets  are  constrained  parameters  which  result  in  residuals 
which  are  consistent  with  the  noise  properties.  Noise  properties  considered  are  range,  moments,  and  various 
second  and  higher  order  properties.  Combettes  and  Trussed  [23]  have  also  derived  feasibility  sets  for  .A.R 
model  parameters  under  constraints  of  stability  and  bounded  (in  norm)  perturbations  on  the  correlation 
matrix  and  vector  in  the  normal  equations. 

4  The  Least  Square  Error  (LSE)  Problem 

We  digress  momentarily  from  the  taxonomy  of  SM  methods  to  interject  some  material  on  the  conventional 
LSE  problem.  This  information  will  be  needed  in  the  “lower  levels"  of  the  BE  discussion  to  fodow,  and 
will  play  a  major  role  in  the  developments  of  the  paper. 

LSE  modelling  is  a  classic  and  wed- understood  tool  for  identification  which  has  an  extensive  research 
history  quite  apart  from  SM  theory.  The  goals  of  this  brief  section  are  twofold:  First,  we  wish  to  discuss 
the  LSE  approach  in  relation  to  the  SM  approach  in  preparation  for  their  combination  in  the  main  section 
of  the  paper.  Secondly,  necessary  notation  for  the  future  development  will  be  introduced. 

4.1  Relationship  Between  the  LSE  and  SM  Problems 

The  general  LSE  problem  (for  the  time  interval  t  G  [!.«])  is  stated  as  fodows: 

Problem  2  ((Weighted)  LSE  Problem)  Observations  y(t),  u(t),  t  6  taken  from  a  system 

assumed  to  follow  a  "true'^  model  of  form  (2).  For  a  similar  model  of  form  (I),  find  the  set  of  parameter 
vectors  (usually  a  singleton),  say  H(n),  such  that  for  each  &  G  E(n).  and  for  any  parameters  F. 

II  e(t.0)  ll^^  -TXn(t)  II  sit.F)  II'  (11) 

where  A„(-)  is  a  sequence  of  nonnegative  weights  which  may  depend  on  n.  and  ||  •  ||  denotes  the  C*’  norm. 

This  problem  resembles  the  form  of  the  general  SM  problem.  Problem  1,  posed  above.  In  particular, 
the  result  at  time  n  appears  to  be  a  "leasible"  set,  H(n).  However,  H(n)  is  not  a  feasible  set,  and  this  is 
not  a  proper  SM  problem.  The  differences  between  the  LSE  problem  and  a  SM  problem  have  been  aduded 
to  above  and  are  subtle  and  revealing. 

Feasible  sets  of  solutions  in  SM  problems  arise  because  of  some  set  of  attributes  we  ascribe  to  the  true 
model  error  at  a  given  time.  .Any  parameter  vectors  which  can  produce  the  given  observations  and  an 


error  sequence  which  has  these  attributes  is  feasible.  Of  course,  the  true  parameters  must  be  an  element  of 
any  feasible  set.  Conspicuously  missing  from  Problem  2  is  any  explicit  statement  of  the  attributes  of  the 
"true"  error  £,(•)  on  the  range  t  £  [1,  n],  which  are  required  in  an  SM  problem  statement.  In  fact,  we  make 
no  such  statement  in  the  LSE  problem.  Implicitly,  we  assume  that  the  "true”  error  sequence  is  "white 
noise,”  implying  that  asymptotically,  it  will  have  the  smallest  possible  average  squared  value  in  light  of  the 
observed  data.  VVe  do  not  necessarily  believe  that  the  noise  is  "small”  and  "locally  white"  (on  the  finite 
time  range  t  £  [l,u])  although  these  are  precisely  the  conditions  which  underlie  (11).  The  justification  for 
using  (11)  is  that  it  asymptotically  leads  to  the  "true”  parameters  if  our  assumption  about  e.(-)  is  indeed 
correct,  .\long  the  way,  the  sets  E(u)  (usually  single  points)  are  not  generally  montonically  decreasing, 
and  do  not  contain  the  true  parameters  0..  They  are  not.  therefore,  '  alid  feasible  sets. 

.•\nother  way  to  view  the  situation  above  is  as  follows.  Suppose  we  were  to  assume  that  (11)  is  a 
reflection  ol  some  attribute.  An.  which  we  do  believe  about  £.(/)  on  t  £  [l,n].  viz.. 


0.  is  such  that  -  ^  A„(<)  ||  £.(t)  ||^  is  minimal. 


(=1 


(12) 


In  this  case  E(n)  plays  the  role  of  a  feasible  set.  Ordinarily,  however.  E(n)  consists  of  a  single  point  which 
is  therefore  both  the  estimate  and  the  true  parameters.  In  this  case  since,  generally,  E(n+  1)  ^  -(tt).  n  = 

1.2  .  the  SM  problem  is  ill-posed  and  our  belief  in  (12)  has  led  to  a  sequence  of  time  varying  true 

parameters  -  contrary  to  another  belief  about  the  system. 

However  one  views  the  situation,  the  conclusion  is  that  the  LSE  problem  is  not  a  valid  SM  problem.  The 
basic  deficiency  is  the  absence  of  any  useful  information  which  serves  to  constrain  the  feasible  parameters 
in  finite  time.  In  fact,  no  attributes  are  assigned  to  the  "true"  error  on  a  finite  time  basis,  and  the  resulting 
"feasible  set"  is  Q(n)  =  72.'"  for  any  n  <  tc.  This  results  in  the  necessity  of  incorporating  all  data  into  a 
LSE  estimate,  since  there  is  no  point-by-point  or  finite  range  basis  for  doing  otherwise.  Combining  the  BE 
considerations  below  will  greatly  remedy  this  inadaquacy  of  LSE  processing. 

Because  we  intend  to  blend  the  LSE  and  BE  theory  below,  and  also  because  LSE  processing  will  emerge 
as  central  to  another  important  technique  to  be  discussed,  it  is  important  to  lay  down  a  formal  framework 
for  LSE  identification. 

4.2  LSE  Problem:  Formalities 

Our  discussions  of  LSE  processing  will  focus  exclusively  upon  models  which  are  LP.  The  objective  here  is 
to  lay  the  formal  foundation  for  these  future  discussions.  Much  of  the  formality  described  here  represent 
a  generalization  of  developments  appearing  in  the  literature. 

With  reference  to  Problem  2  and  surrounding  discussion,  we  assume  the  e.xistence  of  a  "true"  model  of 


form 


y{t)  =  0i!x(t)  +  e,{t)  (13) 

in  which  x{t}  is  some  m-vector  of  functions  of  p.  lags  of  y(-)  and  q,  lags  plus  the  present  value  of  u{-). 
and  where,  in  accordance  with  the  discussion  immediately  above,  £,(•)  is  the  realization  of  a  zero-mean, 
second  moment  ergodic,  vector-valued  random  sequence  whose  components  are  independent; 

E{e,(k  -  -  j)}  =  lim  -  -  i)e^ (t  -  j]  =  -  j)I  (14) 

^  ri— *oc  fi  ^ 

where  £{■}  denotes  the  expectation,  is  some  finite  constant,  S{-)  is  the  Kronecker  delta  sequence  (e.g. 
[45,  p.  37])  and  I  is  the  m  x  m  identity  matrix.  Xo  finite  time  attributes  are  ascribed  to  £,(•)•  At 
we  wish  to  use  the  observed  data  on  /  G  [1,  n]  to  deduce  an  estimated  model  of  the  form  (5). 

y(t)  -  0^ {n)x(t)  +  £(t.&(n))  (15) 

For  the  LP  problem,  the  identified  parameter  vector  will  be  unique  for  each  n  (e.g.  [45],[52],[69]),  but  will 
generally  change  at  every  step.  Hence,  the  index  n  is  very  significant.  In  particular,  we  desire  the  weighted 
LSE  model  for  which  0(n)  satisfies  (11). 

0(n}  can  be  found  as  the  solution  of  the  following  classical  linear  algebra  problem  [46]:  Given  data  (or 
a  system  of  observations)  on  the  interval  t  G  [l.n]  (n  >  m),  and  some  set  of  error  minimization  weights, 
say  A„(-),  form  the  overdetermined  system  of  equations 


s/YJT)y^(2)  ~ 

r  = 

x/An(n)®^(tt)  — 

\/A„(«)y"(n)  — 

denoted 

X(n)r  =  Y(n)  .  (17) 

and  find  the  LS  estimate,  0{n).  for  the  vector  F.  Because  of  this  interpretation,  the  pair  {y(t).x(t)) 
could  appropriately  be  called  an  equation  in  many  contexts  in  the  following.  This  term  is  not  always 
satisfactory,  however.  Whereas  the  term  "datum''  is  inappropriate  to  describe  (y(t).x(t)),  and  "data"  can 
be  misleading,  we  will  frequently  refer  to  {y{t),xU))  as  the  data  set  at  time  t.  The  expression  "per  t" 
should  be  interpreted  to  mean  "per  data  set." 

There  are  well  known  methoiis  to  solve  this  problem.  The  first  is  the  "batch"  solution  given  by  [46] 

0(  u  I  =  |x^(n)X(n)j  X^(n)Y[n)  (IS) 
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with  the  matrix  in  brackets  playing  the  role  of  the  weighted  covariance  matrix'*,  i.e., 


C(n)  =  X^{n)X(n)  =  Xnit)x{t)x^ {t)  .  ( 19) 

(=i 

For  future  reference,  we  also  note  that  the  "auxiliary  matrix"  on  the  right  side  of  ( 18)  can  be  expressed  as 

n 

Cxy(n)  =  X^(n)Y{n)  =  ^  A„(t)x(Oy"( 0-  ('20) 

(=i 

When  written  explicitly  in  the  form 

C{n]0{n)  =  Cxy{n)  (21) 

this  equation  is  frequently  refered  to  as  the  set  of  normal  equations. 

When  the  weights  are  time  varying  by  virtue  of  time-dependent  scaling  of  previous  weights  at  time  n. 

i.e., 

Xnit)  ^  Vt  <  n  -  1,  (22) 

(,(«  -  1) 

where  (,'(•)  is  a  time  dependent  normalizing  sequence,  then  the  weighted  LSE  solution  can  be  computed 
recursively  using  the  relations  [81] 


C3(n) 

C-'(n) 

0(n) 


C( «)/((«) 

,,  ,  ,  ^C7’(n  -  l)x(u)x^(u)C7*(n  -  1) 

C/(u  -  1)  -  A„(n) -  T^T  { - 

I  +  ^n{  ^i)Gs(ti) 

&{n  -  1 )  -f  An(u)C"*(n)x(n)e^(n.0(n  -  1)) 


(■23) 

(•24) 

(•25) 


where  6's(  n)  x^^(n)Cj^{n  -  1  )x(n).  For  future  reference,  let  us  also  define  the  "unsealed"  version  of  this 
last  quantity,  G(u)  ='  G3{n)/((n)  =  x^(n)C“'(rt-  l)x(n).  (In  general,  quantities  with  subscripts  "s''  will 
indicate  that  the  scale  factor  is  included,  and  those  without  such  subscripts  are  the  unsealed  counterparts.) 

.As  an  aside,  we  note  that  the  scaling  sequences  (,'(•)  will  play  a  key  role  in  the  developments  to  follow. 
One  pecubarity  will  occur  with  regard  to  this  sequence  in  a  very  important  SM  algorithm.  In  this  case 
C(  )  will  be  such  that,  for  each  n,  ((n  -  1)  depends  on  a  quantity  which  will  not  be  computed  until  time 
n.  In  general,  we  shall  distinguish  between  "causal"  and  "noncausal"  scabng  sequences.  A  cau.sn/  scaling 
sequence  (,'(■)  is  one  for  which,  for  every  n,  and  for  all  n'  >  n,  (fn)  is  independent  of  any  quantity  which 
is  not  computed  until  time  n' .  In  simple  terms,  a  causal  scaling  sequence  is  one  which  does  not  depend 
on  “future"  processing  to  determine  its  "present"  values.  If  (,'(•)  is  not  causal,  then  it  is  noncausal.  It 
might  seem  that  a  noncausal  sequence  would  be  all  but  impossible  to  work  with.  but.  as  noted,  we  shall 
encounter  one  interesting  case  to  the  contrary. 


‘More  precisely,  this  is  a  normal  matrix  which  becomes  a  “covariance"  matrix  asymptotically  if  scaled  by  1/n.  and  if  the 
mean  of  the  vector  i{l)  is  zero  for  all  t.  VVe  shall  use  the  conventional  term  "covariance"  in  this  work. 
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When  the  scaling  sequence  ({■)  is  unity  for  all  time,  then  the  (24)  and  (25)  are  usually  called  recursive 
least  squares  (RLS)  (e.g.  see  [69], [81])  or  sequential  least  squares  (SLS)  (e.g.  see  [45]),  and  the  word 
■‘weighted”  is  sometimes  added  to  give  WRLS  or  WSLS.  When  the  scaling  factor  is  taken  to  be  constant, 
say  ({n)  =  Vra,  and  such  that  0  <  q  <  1.  then  o  is  called  a  fogetting  factor  (FF),  and  acronym  like 
“SLSFF”  might  be  used.  In  any  case,  we  will  use  the  acronym  “WRLS”  to  refer  to  a  recursive  computation 
of  the  weighted  LSE  solution,  and  in  particular  we  will  call  (23)  -  (25)  MIL-WRLS  to  'ndicate  recursions 
based  on  the  matrix  inversion  lemma  (MIL)  [45],[69],[8l].  This  is  to  be  juxtaposed  with  QR-WRLS 
described  in  the  foUowing  paragraph. 

When  the  weights  conform  to  (22),  one  can  use  a  contemporary  WRLS  algorithm  based  on  the  QR 
decomposition  of  the  X(n)  matrix  of  (17)  [28|,[33|,(43|,[48],[71],[72|.  The  procedure,  in  principle,  involves 
the  application  of  a  sequence  of  orthogonal  operators  (Given's  rotations)  to  (17)  which  leaves  the  system 
in  the  form 


T{n) 

r  = 

■  D,(n)  ' 

( n  - 171 )  X  m 

where  the  matrix  T{n)  is  an  m  x  rn  upper  triangular  Cholesky  factor  [46]  of  C{n)  (see  (27)  below),  and 

denotes  the  i  x  j  zero  matrix.  D\{n)  and  D^in)  are  m  x  k  and  [n  -  m)  x  k  matrices,  respectively, 

which  result  from  the  operations  on  Y{n).  It  will  be  useful  in  our  work  below  to  note  that 

C(u)  =  X^(n)X(u)  =  r^(n)T(n)  ('27) 

because  T{n)  represents  an  orthogonal  transformation  on  X(n).  The  system 

T(n)0(n)  =  D,(n)  (28) 

is  easily  solved  using  back  substitution  [46]  {k  times,  once  for  each  column  of  0(n)  and  D\{n))  to  obtain 
the  LSE  estimate.  0{n).  When  the  ri  +  I"®'  data  set  becomes  available,  it  is  weighted  by  [A„+i(n  +  l)]'t^^ 
and  the  matrices  T(n)  and  D\{n)  are  scaled  by  before  incorporating  this  new  information.  This 

procedure  can  be  performed  in  a  recursive  manner  using  only  about  rn^  +  km  memory  locations.  Details 
for  the  SISO  rase  (which  are  easily  generalized)  are  found  in  [28], [33]. [48]. [71].  We  shall  use  the  name 
QR-WRLS  to  refer  to  this  form  of  the  recursion.  This  formulation  mak'  >  possible  the  solution  of  the 
ellipsoid  algorithms  to  be  described  on  contemporary  parallel  architectures  (discussed  in  Section  6.7)  for 
great  speed  advantages. 
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5  The  Bounded  Error  (BE)  Identification  Problem 


5.1  Overview 

We  now  return  to  Fig.  1  and  the  survey  of  SM  methods  and  discuss  the  most  widely  researched  group  of 
techniques,  those  based  on  a  BE  constraint.  The  general  problem  statement  is  as  follows; 

Problem  3  (BE  Identification  Problem)  Observations  y{t),  u(t).  t  G  [l.u]  are  "known"  to  have  been 
generated  by  a  "true”  model  of  form  (3)  whose  error  sequence  is^  "pointwise  energy  bounded" 

:  II  £-(n)  ||'^<  7(n),  (29) 

where  ■){■)  is  a  known  positive  sequence*^.  We  propose  a  model  of  form  (I)  with  'P  =  'P..  p  =  p.,q  = 
q..r  =  r.,  whose  parameters.  0.,  are  unknown,  but  whose  error  sequence  adheres  to  An  at  time  n.  Find 
the  feasible  set  of  parameters.  fl(n).  such  that  for  each  0  G  n(n),  the  proposed  model  is  consistent  with 
the  observations. 

BE  methods  are  categorized  into  those  which  the  models  are  LP  and  those  which  are  non-LP  (see 
Section  2.2).  The  feasible  solution  sets  that  arise  as  a  consequence  of  error  bounding  assume  different 
geometries  in  the  parameter  space,  depending  on  the  form  of  the  model.  In  general,  constraints  of  form 
(29).  in  conjunction  with  a  model  of  form  (1)  and  the  measured  data,  imply  pointwise  feasible  sets 

w'(n)  =  |0  I  II  j/(n)  -  ^'[n.0.j/.n.e,p.(?.  r]  il^<  7(n,)| .  (30) 

These  can  be  intersected  over  time  to  create  a  feasible  set  over  the  range  t  G  [1.  n], 

n 

n(u)=f|^-(n.  (31) 

f=i 

For  a  non-LP  model  the  "local"  .v'(n)  are  generally  hypersurfaces  in  which,  when  intersected  over 

time,  create  sets  which  may  have  highly  irregular  geometries  and  which  need  not  be  connected  in  the 
parameter  space  (see  e.g.  Fig.  2  and  [121]).  The  work  that  has  been  done  on  such  problems  has  been 
largely  concerned  with  developing  novel  algorithnis  for  .\!ISO.  real  parameter,  systems. .which  bound  Q(n). 
Specific  approaches  can  be  found  in  [6],  [8],[10],[13],[l9],[34]-[36].[56],[o7],[74].[84].[90],[93].[97].[104].[l05], 
[1 13).  [123].  [124],  [128], [129].  Since  the  focus  of  this  paper  is  upon  a  special  class  of  LP  methods  and  signal 


'This  is  slightly  less  general  than  stating  a.synimetnral  amplit'uie  bounds,  7rnin(u)  <||  s.(n)  ||<  7m,ix(n),  but  the  verv 
slight  loss  of  generality  is  worth  the  significant  analytic  gain  afforded  by  this  assumption. 

'  VVe  shall  a-ssume  this  sequence  known  throughout  this  paper.  While  determination  of  appropriate  error  bounds  often 
follows  naturally  from  the  physical  constraints  of  the  problem,  in  other  ca-ses  this  determination  is  challenging.  One  theoretical 
approach  is  found  in  [lid],  while  an  experimental  discussion  for  a  particular  application  is  found  in  [i'tj. 


processing  applications,  we  shall  not  further  pursue  the  topic  of  non-LP  models'.  An  exceOent  place  to 
begin  a  review  of  non-LP  methods  is  with  the  recent  paper  by  Walter  and  Piet-Lahanier  [121]. 

In  the  LP  model  case,  error  bounding  implies  pointwise  ••hyperstrip”  regions  of  possible  parameter  sets 
in  the  space. 

-j(u)  =  I©  (  II  y(n)  -  0^x(n)  |p<  7(u)j,  (.32) 

which,  when  intersected  over  a  given  time  range  (see  (31)),  usually  form  convex  polytopes  of  feasible 
parameters  (see  Fig.  3).  Three  different  approaches  have  been  introduced  which  describe  or  characterize  the 
feasible  parameter  sets.  The  first  approach  (developed  for  real,  generally  MISO,  systems)  produces  exact 
parameterized  descriptions  of  these  polytopes  [7],[16].[17].[73],[77],[78],[92].[122],[12.5]-[127].  Although  this 
approach  is  recursive  and  simple,  its  computational  complexity  increases  with  the  number  of  vertices  of  the 
polytope.  The  second  approach  (also  for  real  MISO  systems)  gives  orthotopic  outer  bounds  of  the  solution 
sets  [.3], [73], [77].  This  approach  yields  exact  parameter  uncertainty  intervals  at  the  expense  of  very  complex 
computations.  The  third  approach  is  of  much  lower  complexity  compared  to  the  first  two  and  works  with  an 
outer  bounding**  hyperellipsoid,  a  superset  of  the  polytope  [24].[26],[27].[29]-[32],[37],[3S],[41],[49],[.51].[S4]- 
[89],[99]-[l01]. 

Ellipsoid  algorithms  are  often  presented  as  BE  procedures,  and  indeed  they  do  foUow  from  the  BE 
constraints.  However,  they  are  more  fruitfully  viewed  as  a  marriage  between  the  LSE  and  BE  problems 
for  LP  models.  With  this  point  of  view,  signal  processing  engineers  have  begun  to  exploit  the  benefits  of 
BE  information  in  the  context  of  LSE  identification  problems.  To  stress  this  point  of  view,  we  feature  the 
ellipsoid  algorithms  in  their  own  section  to  follow.  This  subject  will  be  covered  in  considerable  detail  and 
will  comprise  the  remainder  of  the  paper. 

6  Combining  the  LSE  and  BE  Problems:  Ellipsoid  Algorithms 

6.1  A  Unified  Optimal  Bounding  Ellipsoid  (UOBE)  Algorithm 

Please  note  that  the  rigorous  development  of  several  of  the  key  results  to  follow  are  found  in  the  appendices. 

The  benefits  of  combining  BE  considerations,  when  they  are  known,  with  LSE  identification  have 
bei'n  alluded  to  in  Section  4.2.  LSE  identifiers  exploit  no  point-by-point  information  which  can  be  used 
to  ascertain  the  usefulness  of  observations.  This  fact  manifests  itself  in  the  effective  retention  of  the 
entire  parameter  space  as  a  "feasible  set."  and  results  in  w'asteful  processing.  The  idea  to  combine  BE 
considerations  with  LSE  identification  did  not  arise  out  of  a  quest  to  make  LSE  processing  more  efficient, 

\titli  one  exception  Methods  developed  for  AR.X  (LP)  models  have  been  extended  for  use  with  ARMA  and  ARMA.X 
(non-LP)  models  [84].[99l.[lbl]'  We  will  discuss  these  techniques  below. 

'Inner  bounding  algorithms  of  the  last  two  approaches  have  aLso  been  presented  in  [.85]. [116] 
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however.  Rather,  it  resulted  from  the  discovery  that  ellipsoid  bounding  algorithms  are  very  closely  related 
to  VVRLS.  While,  clearly  the  feasible  set  arising  from  any  SM  algorithm  will  contain  the  LSE  estimate,  it 
is  the  ellipsoid  algorithms  which  have  a  particularly  attractive  relationship. 

We  begin  by  seeking  a  solution  to  the  SM  (BE)  problem.  Since  we  are  working  with  an  LP  model,  the 
BE  constraint  (see  Problem  3)  is  given  by 

II  y(n)  -  0l^x(n)  7(n).  (33) 


It  follows  readily  that  (see  Lemma  1  in  Appendix  .4.2) 


11  y(t)  -  0^x(t)  1|^<  Y^nithit) 


(  =  1 


(=1 


(34) 


for  any  positive  numbers  A„(t),  t  6  [1,  u].  For  any  nonnegative  sequence  A„(-),  (34)  specifies  a  set  to  which 
0,  must  belong.  Let  us  denote  this  set. 


Q(n)  = 


0|  H  A„(0  II  l/(t)  -  )1'^<  ^A„(07(0, 


t=i 


(=1 


(35) 


Note  that  all  elements  of  fl(n)  need  not  be  in  the  usual  feasible  set,  fl(n),  consisting  of  the  intersection  of 
pointwise  ’‘hyperstrips”  (see  (31)  and  (32)).  In  fact,  Cl(n)  can  be  almost  any  size  depending  on  the  choice 
of  numbers  A„(t),t  G  [l.n].  (Note  that  A„(-)  is  indexed  (subscript)  by  the  end-time  n  because  we  might 
wish  to  have  a  completely  different  sequence  of  parameters  at  each  n  to  control  the  size,  placement,  etc. 
of  the  hyperellipse.)  Whatf'ver  sequence  Ar.(-)  is  chosen,  however,  the  set  Q{n)  must  contain  the  feasible 
set,  and  therefore 

0.  e  n(n)  c  n(n).  (36) 


The  following  development  is  rigorously  supported  by  Proposition  I  in  .Appendix  .4.1.  Some  manipu¬ 
lation  of  (35)  shows  that  the  set  U(n)  may  be  expressed  as  follows; 


n(n)  =  {  tr  {©  1  [0  -  0A^»)]^3^(«)10  -  0An)]}  <  l}  (37) 

where  tr{  }  denotes  the  trace  of  a  matrix.  This  set  as  a  hyperellipsoid  in  with  its  "center"  at 

©  -(  n  ).  We  give  meaning  to  the  term  "hyperellipsoid"  below.  The  fundamental  connection  of  this  ellipsoidal 
set  to  to  the  weighted  LSE  problem  is  as  follows:  The  center  of  the  ellipsoid  is  exactly  the  weighted  LSE 
estimate  using  weights  A,i(-). 

©-(  «)  =  ©(«),  (33) 


and  the  ellipsoid  matrix  tP(n)  is  a  scaled  version  of  the  associated  covariance  matrix 


^{n) 


Cju) 

k(  n) 


(39) 
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(40) 


(see  (18)  and  (19)),  and  where  K(ra)  is  the  scalar  quantity, 

n 

/c(7i)  =  tr{©"(Ti)C(n)0(u)}  +  ^  7(0A„(<)  [l  -  ||  y(0  11^ 

(=:! 

To  give  meaning  to  the  lerm  "hyperellisoidr  consider  a  single  column,  say  0,(n),  of  0(n)i  corresponding 
to  output  y,(-)  in  vector  y(-).  Using  (37),  w'e  see  that  Oi(n)  is  constrained  to  be  an  element  of  a  set  which 
is  properly  called  a  hyperellipsoid,  say. 

fi.(n)=|d.  |[«.-0.(n)]^^[0. -0.(n)]<  l|.  (41) 

In  particular,  when  0,(n)  is  real  and  of  dimension  two,  the  perimeter  of  fi,(n)  is  precisely  what  is  con¬ 
ventionally  regarded  as  an  ellipse  in  (see  Fig.  3).  Notice  that  aU  outputs  yi{-),  i  =  1,2,...,  A;,  will 
apparently  share  the  same  "‘ellipsoid  matrix,"  Clnl/Kln),  but  their  corresponding  ellipsoids  will  be  cen¬ 
tered  on  different  estimates.  This  fact  will  be  important  in  the  optimization  problem  to  be  discussed 
below.  Also  note  that  the  influence  of  the  "other"  outputs  in  y(-)  on  the  ellipsoid  Cli{n)  arises  through  the 
parameter  K{n).  This  means  that  the  MIMO  solution,  as  we  shall  describe  it  here,  is  not  equivalent  to  a 
decomposition  of  the  problem  into  k  MISO  systems.  However,  the  MIMO  problem  does  correctly  include 
the  MISO  problem  as  a  special  case. 

We  conclude  therefore  that  under  known  BE  constraints,  a  hyperellipsoid  can  be  associated  with  a 
weighted  LSE  estimation  problem  and  conversely.  This  .set  is  illustrated  in  Fig.  3  for  the  two-dimensionaJ 
case.  Clearly,  the  weights  Xn(-)  parameterize  the  ellipsoid  and  presumably  can  serve  to  minimize  its  size 
and  orientation  in  the  parameter  space.  .Anticipating  that  we  will  want  to  work  with  recursive  least 
squares  estimation,  let  us  henceforth  restrict  our  attention  to  weight  sequences  which  conform  to  the 
scabng  pattern^  (22).  This  effectively  restricts  to  one  (  viz.  Xn(n)J  the  number  of  free  parameters  available 
to  control  the  bounding  ellipsoid  at  timt  n.  The  central  objective  of  a  bounding  ellipsoid  algorithm  is  to 
employ  the  weights  in  the  context  of  LSE  f’stimation  to  sequentially  optimize  some  feature  of  the  ellipsoid 
(directly  or  indirectly  related  to  its  "size").  .A  .significant  benefit  is  that  often  no  weight  exists  which  can 
minimize  the  ellipsoid,  indicating  that  the  incoming  data  set  is  uninformative  in  the  SM  sense. 

While  it  may  not  be  immediately  apparent  from  the  original  developments  in  the  literature,  all  publi.shed 
bounding  ellipsoid  algorithms,  both  adaptive  and  nonadaptive,  adhere  to  the  following  steps.  Let  us  refer 
to  this  set  of  operations  as  the  I'nified  Optimal  Bounding  Ellipsoid  (UOBE)  algorithm  (for  a  complex 
MIMO  LP  system):  .At  time  n. 

1 .  In  conjunction  with  the  incoming  data  set  {y(n).x{ n )).  find  the  weight,  say  A*  ( n),  which  will  prospec¬ 
tively  optimize  some  quantitative  feature  of  fifu)  related  to  its  "size."  (This  will  require  knowledge 
of  C(  u  -  I ).  k(  u  -  1 ),  and  (,'(  u  -  1 ).) 

^.\n  exception  to  this  rule  is  that,  for  adaptive  strategies  to  be  discussed  below,  we  will  additionally  allow  An(()  to  bi  set 
to  itero  for  one  or  more  (  <  n  —  1 


2.  Discard  the  data  set  (y(n},x(n))  if  A*(r)  <  0. 

3.  Update  C(n)  and  0(n)  using  MIL-WRLS  or  QR-VVRLS  (see  Section  4.2). 

4.  Update  K(n)  according  to  (40)  or  one  of  the  recursions  given  in  Lemma  2  in  .Appendix  .A. 2. 

With  one  exception  (see  Dasgupta-Huang  QBE  algorithm  below),  all  published  QBE  algorithms  operate 
on  the  principle  of  minimizing  a  set  measure  on  Cl(n)  by  choice  of  A*(u).  For  a  SISO  system,  Fogel  and 
Huang  suggest  two  set  measures  for  the  optimization.  The  first  is  the  determinant  of  the  matrix  #“'(n). 

^i,{Q(n)}  =  (42) 

and  the  second  is  the  trace, 

/Jt{Q(n}}=  tr{#"‘(ti)}.  (43) 

(Having  established  these  quantities  as  set  measures  on  Q(n),  for  simplicity,  we  shall  henceforth  write 
/ii,(n)  and  We  shall  also  occassionally  write  fi(n)  to  mean  “either  /iv(^)  Mi(^)-")  Hi  the  MISO 

case  in  which  f2(n)  is  clearly  intepretable  as  an  ellisoid  (see  (41)),  /iv(tt)  is  proportional  to  the  square  of 
the  volume  of  the  ellipsoid,  while  Ai((^)  is  proportional  to  the  sum  of  the  square  root  of  its  semi-axes.  A 
moment’s  reflection  will  indicate  that  the  same  two  measures  are  meaningful  in  the  MIMO  case,  since  they 
result  in  the  minimization  of  the  volume  or  trace  of  the  common  ellipsoid  shared  by  all  the  outputs  (see 
discussion  below  (41)). 

It  is  shown  in  Proposition  2  in  .Appendix  .A.l  that,  when  the  scaling  sequence  is  causal  (see  discussion 
below  (25)),  then  A*(n)  is  the  unique  positive  root  of  the  polynomials  f„(A)  and  F((A)  for  the  volume  and 
trace  measures  respectively,  where  is  a  quadratic, 

Fv{^)  =  02^^  +  +  oq  ,  (44) 


and  Ft  is  a  cubic  polynomial 

Tf(A)  =  63A^  62A^  -|-  6jA  -t-  6o  .  (45) 

The  coefficients  a,  and  b,  are  given  in  terms  of  quantities  which  are  known  prior  to  time  n. 

Interestingly,  we  will  find  that  the  optimization  procedure  is  not  “locally"  affected  by  a  causal  scaling 
process.  This  is  so  because  neither  measure  nor  fit  is  changed  when  the  scale  factor  is  included.  To 
show  precisely  what  we  mean  by  this,  consider  the  optimization  problem  at  time  n.  .All  previous  weights 
will  be  modified  by  scale  factor  (,'(n  -  1).  We  have  called  the  resulting  covariance  matrix  C s{n  -  1)  = 
C(  n  -  I )/(,'( n  -  1 ).  The  definition  of  k(  n  -  1 )  in  (40)  will  indicate  that  the  effect  of  weight  scaling  on  this 
quantity  is  likewise  a  simple  scaling. 


M  Kin  -  1 ) 


Accordingly,  the  ellipsoid  matrix  C(ri  -  1)/k(7i  -  1)  is  changed  to  Csin  ~  1)/Ks(n  -  1).  by  the  scaling 
procedure.  Note,  however,  that  the  scale  factors  cancel  in  this  ratio,  so  that  either  of  the  measures  of  size 
will  remain  unchanged.  It  is  very  important  not  to  infer  that  the  optimization  process  is  independent  of 
the  scaling  factors.  Clearly  the  existing  covariance  matrix  and  k  value  at  each  time  is  influenced  by  the 
complete  history  of  the  scale  factors.  The  consequence  of  the  analysis  above  is  simply  that  the  elbpsoid 
volume  at  a  specific  time  is  not  affected  by  scabng.  This  will  have  impbcations  for  theoretical  developments 
surrounding  the  optimal  weight  (see  Proposition  2)  in  Appendix  A.l. 

Finally,  we  note  an  important  fact  to  which  we  will  return  in  Section  6.6.  For  the  volume  algorithm 
using  weights  of  form  (22),  it  can  be  shown  that,  if  an  optimal  weight  e.xists.  it  will  definitely  shrink  the 
volume  of  the  ellipsoid.  A  similar  result  can  be  obtained  for  the  trace  measure  [80].  This  has  important 
implications  for  convergence  of  the  ellipsoid,  the  analysis  of  which  has  been  widely  misunderstood. 

.4  detailed  version  of  the  UOBE  algorithm  for  a  MISO  system,  which  is  based  on  QR-\VRLS  and  the 
volume  criterion,  appears  in  Fig.  4.  It  should  be  clear  how  to  incorporate  changes  necessary  to  implement 
a  "trace”  algorithm,  or  to  introduce  additional  outputs.  This  general  algorithm  will  embrace  any  of  the 
specific  algorithms  discussed  below.  We  now  discuss  variations  on  this  general  algorithm. 

6.2  The  Fogel-Huang  OBE  Algorithm 
6.2.1  History  and  Development  of  F-H  OBE. 

The  first  major  journal  paper  on  the  application  of  ellipsoid  algorithms  to  parametric  LP  models  was 
published  by  Fogel  and  Huang  in  1982  [41].  The  Fogel-Huang  algorithm  is  frequently  called  the  optimal 
hounding  ellipsoid  (OBE)  algorithm,  and  we  shall  adopt  the  name  "F-H  OBE”  in  this  paper  to  distinguish 
it  from  another  algorithm  to  be  presented  below.  F-H  OBE  was  originally  presented  for  the  SISO  .ARX 
model,  but  it  is  easily  generalized  using  the  developments  described  in  this  paper.  This  method  follows 
the  basic  framework  of  the  UOBE  algorithm  enumerated  above,  with  the  following  specific  conditions: 

1.  O't)  —  K{n)  for  each  n; 

2.  MIL-WRLS  is  used  to  implement  the  recursions  (but  QR-WRLS  can  be  used  as  well); 

8.  Set  measures  (42)  and  (43)  are  employed. 

It  is  interesting  to  note  that,  because  the  scaling  sequence  (j(-)  is  equivalent  to  the  sequence  k(  )  in  F-H 
OBE.  the  ellipsoid  matrix  at  time  n.  ^{n).  is  identical  to  the  scaled  covariance  matrix  Cs(  n )  =  C( r;  )/C(  u  ) 
whose  inverse  is  computed  directly  in  the  course  of  the  .MIL-WRLS  equations.  This  is  a  consequence  of 
the  geometric  approach  taken  (see  below)  rather  than  a  deliberate  choice  of  the  scaling  sequence.  We  also 
note  that  there  is  nothing  to  preclude  the  use  of  QR-WRLS  in  conjunction  with  E-H  OBE.  .Mternative 
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versions  of  OBE  have  been  published  and  will  be  described  below,  but  first  it  is  interesting  to  place  the 
F-H  OBE  development  in  historical  perspective. 

In  [41],  using  the  BE  constraints  only,  Fogel  and  Huang  arrive  at  the  ellipsoid  of  form  (.37)  and  recog¬ 
nize  that  the  center  of  the  ellipsoid  is  a  weighted  LSE  estimate.  However,  the  LSE  problem  is  not  pursued 
directly.  Instead,  tlte  fact  that  ellipsoids  can  be  used  to  bound  the  feasible  set  is  used  as  a  motivation  for 
the  following  geometric  approach;  Assume  that  a  membership  set  Cl(n  -  1)  is  known  at  time  u  -  1.  We 
need  not  be  aware  of  parameters  A„_i(t),  t  G  [I.n  —  1],  nor  even  that  there  is  a  LSE  problem  underlying 
the  membership  set.  The  objective  is  to  find  a  new  (small,  if  possible)  ellipsoid  which  superscribes  the 
intersection  of  n(n  -  1)  with  the  incoming  feasible  ■•hyperstrip"  -;(u)  (see  (3‘2)  and  Fig.  o).  The  work  of 
Kahan  [53]  had  shown  that  a  family  of  such  circumscribing  ellipsoids  could  be  computed  using  relations 
which  the  authors  then  manipulate  into  the  equations  which  comprise  F-H  OBE.  The  quantity  q{n)  (equiv¬ 
alent  to  A„(n))  emerges  as  a  single  parameter  with  which  to  control  the  size  of  the  ellipsoid  fi(«  ).  We  will 
henceforth  refer  to  q{n)  as  A„(n),  even  though  this  does  not  connote  the  geometric  spirit  of  the  original 
F-H  OBE  development. 

F-H  OBE  is  sometimes  called  the  minimum  volume  sequential  (MVS)  algorithm  when  it  is  is  based 
upon  sequential  minimization  of  Hx,.(n).  This  involves  the  construction,  and  solution  for  the  roots  of.  the 
quadratic  equation  (44)  to  find  the  optimal  parameter  A*(n).  Similarly,  the  minimum  trace  sequential 
(MTS)  algorithm  is  based  upon  minimization  of  ^it{n)  by  optimizing  A„(n).  This  procedure  requires  the 
construction  and  solution  for  the  positive  root  of  the  cubic  equation  (45). 

The  F-H  OBE  algorithm  was  the  first  I’OBE-type  algorithm  to  be  presented  as  having  potential 
benefits  for  signal  processing  [49].  These  benefits  derive  from  the  optimization  procedure,  as  alluded  to 
above.  Generally  speaking  (precise  comments  are  found  in  Section  6.6  below),  as  n  increases,  the  true 
feasible  set  n(u),  and  the  ellipsoid  decrease  in  size  and  it  becomes  increasingly  likely  that’° 

0(  n )  =  .v'l  n  )  n  0(  n  -  1 )  =  0(  n  -  1 ) .  (  IS ) 

This  means  that  the  new  data  set  is  not  providing  any  u.seful  information  in  the  sense  of  shrinking  the 
membership  set.  There  is  no  positive  parameter  A,i(ri)  with  which  to  combine  the  data  set  at  time  n 
with  the  current  ellipsoid  to  create  a  smaller  ellipsoid.  The  manifestation,  therefore,  is  that  the  "optimal" 
parameter  in  the  sense  of  minimizing  q(  n ).  A*  ( n ),  is  nonpositive.  In  this  case  the  data  set  at  time  n  should 
be  rejected  and  the  computational  effort  of  processing  it  avoided.  In  many  simulations  and  e.xperinients 
with  real  data  (e.g.  ['24].['29].[3'2].[49] ).  typically  70  of  ’he  data  are  "rejected"  in  this  sense. 

'’It  ,q,<;o  becomes  increasingly  likely  that 

n  t  =  n  )  n  Q(  n  —  1 )  =  {q  n  —  1 ),  i  17  I 


but  this  does  not  necesarilv  mean  that  the  incoming  data  set  cannot  be  used  to  minimize  the  elltpsotd. 


A  critical  point  about  this  data  selection  process  must  be  made  which  was  not  necessarily  evident  in  the 
early  papers.  One  must  be  very  careful  not  to  infer  that  the  complexity  of  the  F-H  OBE  (or  any  UOBE) 
algorithm  is  drastically  reduced  (say,  by  905?)  by  virtue  of  this  data  selection  procedure.  In  fact,  in  its 
basic  form,  if  the  parameter  matrix  0.  is  of  dimension  m  x  k.  then  each  of  the  optimality  checks  requires 
O(m^)  complex  floating  point  operations  (cflops).  then,  vvhen  accepted  another  0[{2  +  .bk)m^]  cflops  are 
required  to  update  the  covariance  matrix  and  parameter  estimates”.  To  process  the  data  set  directly 
using  WRLS  requires  C>(3m^)  cflops.  While  a  dramatic  decrease  in  the  number  of  data  used  results,  the 
computational  load  is  not  significantly  decreased,  especially  for  large  m.  There  are  methods  to  remedy 
this  problem  which  will  be  discussed  below. 

.-\s  an  aside,  we  note  that  the  .\1VS  version  of  the  F-H  OBE  algorithm  is  suboptimal  in  the  following 
sense.  When  one  of  the  hyperplanes  bounding  -.•(»)  does  not  intersect  Ct(n  -  1 ).  a  smaUer  (volume)  0(n) 
can  be  achieved  by  repositioning  the  nonintersecting  plane  to  be  tangent  to  0(n  -  1).  Belforte  and  Bona 
have  suggested  this  procedure  in  [oj  (.see  also  [9]).  .\s  pointed  out  by  Walter  and  Piet-Lahanier  [121]. 
the  modified  procedure  is  equivalent  to  the  ellipsoid  with  parallel  ruts  (EPC)  algorithm  developed  by 
researchers  working  in  linear  programming  [.a9].[103]. 

6.2.2  Dasgupta-Huang  OBE 

.A  significant  variation  on  the  F-H  OBE  algorithm  has  been  suggested  by  Dasgupta  and  Huang  [24]  since 
the  publication  of  the  original  algorithm.  .Again,  the  method  is  originally  developed  for  the  SISO  .AR.X 
model,  but  can  be  generalized  using  the  devlopments  in  this  paper.  The  Dasgupta-Huang  OBE  (D-H 
OBE)  algorithm  has  two  unusual  features  with  respect  to  all  other  I'OBE  algorithms.  These  are  the  use 
of  noncausal  scale  factors,  and  an  optimization  procedure  which  does  not  seek  to  directly  minimize  a  set 
measure  on  S7(n).  D-H  OBE  employs  the  scale  factors 

^(n  -  1)  =  (1  -  A;(»))-'.  (19) 

With  reference  to  ( 19)  and  (22),  it  is  seen  that,  for  a  given  optimal  weight  A‘(n).  the  updated  cij.ariance 
matrix  is  a  convex  combination  of  C(n  -  1)  and  the  new  data  outer  product. 

C{n)  =  (1  -  A'Jn)  )C( »  -  Ij  +  A'ln  )x(  u  )x n)  ( .'0 ) 

.Accordingly,  this  choice  of  sc.ale  factors  constraints  the  optimal  weights  to  the  range  [O.o]  for  o  <  1.  The 
central  benefit  of  this  method  is  that  it  provides  the  means  with  which  to  prove  asymptotic  and  exponential 
convergence  of  the  ellipsoid,  and  cessation  of  updating,  using  Lyaponov  theory.  1,'pon  convergence,  the 
residuals,  €(■.  ©(•))  are  guaranteed  to  remain  in  the  "dead  zcrne"  indicated  by  the  error  bounds,  i.e.. 


Throughout,  a  rflop  is  taken  to  be  complex  multiplication  ami  one  lomph  x  niiiition 
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lim  i_3c  11  e(i.  0( «) )  ll'^<  7(f  )•  The  number  ( 1  -  A*{n))  is  referred  to  as  a  "forgetting  factor"  by  Dasgupta 
and  Huang,  and.  although  it  does  serve  to  downweight  the  past  contributions  of  to  the  covariance  matrix, 
it  is  not  a  forgetting  factor  in  the  conventional  sense,  since  it  is  not  a  free  parameter  and  therefore  does  not 
explicitly  control  adaptation.  The  algorithm  does  e.xhibit  some  adaptation  capabilities  as  do  other  L  OBE 
algorithms  due  to  the  optimal  data  weighting.  Explicitly  adaptive  UOBE  algorithms  will  be  discussed 
below. 

\  second  significant  difference  in  the  D-H  OBE  algorithm  occurs  in  the  technique  employed  for  deter¬ 
mining  "optimal"  weights  A^(n).  Rather  than  minimize  a  set  measure  such  as  (44)  or  (4.5).  the  weight  is 
chosen  to  minimize  K(n).  subject  to  the  constraint  that  it  be  in  the  allowable  range  [O.q].  The  reason  for 
this  choice  is  that  K{n)  is  a  bound  on  the  Lyapunov  function  used  in  the  minimization.  side  benefit  is 
that  the  check  for  usefulness  of  the  data  set  is  very  cost  effective.  We  will  return  to  a  discussion  of  this 
“unconventional"  optimization  technique,  as  well  as  issues  of  computational  efficiency,  in  two  places  below. 

It  is  notable  that,  in  spite  of  the  "noncausal"  scaling  factor  rt  time  n  -  1.  C,{n  -  1).  which  might  be 
expected  to  create  intractable  nonlinearities  with  respect  to  A*(n).  it  is  still  possible  to  derive  polynomials 
like  (44)  and  (4o)  with  which  to  optimize  .set  measures  of  fi(n)  [68j.  Doing  so.  however,  defeats  one  of  the 
main  purposes  of  using  the  complicated  scale  factors,  and  whether  such  an  optimization  has  any  usefulness 
remains  an  open  question.  We  return  to  this  issue  in  Section  6.b.2. 

6.3  The  SM-WRLS  Algorithm 

6.3.1  History  and  Development  of  SM-WRLS. 

While  developed  geometrically,  we  know  that  the  F-H  OBE  algorithm  solves  a  LSE  problem  with  time 
varying  weights.  From  this  point  of  view,  it  is  interesting  to  note  that  the  algorithm  is  charged  with  focusing 
on  the  hyperstrip  n  )  associated  with  the  "new"  data  set.  Intuitively,  the  scaling  down  of  previous  weights 
is  consistent  with  this  concentration  on  the  new  data  set.  However,  it  is  prudent  to  wonder  whether  a 
tighter,  or  at  least  "simpler"  membership  .set  could  be  found.  The  SM-WRLS  algorithm,  to  which  we  now- 
turn,  addresses  both  the  concern  for  a  more  conventional  algorithm  and  the  more  "uniform"  attention  to 
the  true  feasible  set. 

Even  though  Fogel  and  Huang  clearly  state  in  their  lf).'<2  paper  that  there  is  an  LSE  problem  underlying 
F-H  OBE.  the  geometric  approach  taken  tends  to  obscure  its  presence.  The  approach,  notwithstanding, 
tiowever,  the  >imilarity  of  the  F-H  OBE  (as  well  as  the  D-H  OBE)  equations  to  WRLS  is  striking,  and 
it  has  not  gone  unnoticed  in  the  literature.  In  their  recent  paper.  Walter  and  Piet- Lahanier  make  the 
following  remarks  1121];  "Let  us  stress,  however,  that  the  FiiPC  and  .MVS  algorithms  are  not  just  another 
variation  of  RLS.  .As  Schweppp  puts  if  (i02l.  a  comparison  of  set  theoretic  concepts  with  stochastic  theory 
reveals  that 
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1.  the  detailed  mathematical  manipulations  are  very  different. 

2.  the  final  equations  look  similar. 

3.  the  final  equations  behave  quite  differently  in  general. 

“.Moreover,  the  type  of  information  needed  is  completely  different.  The  RLS  algorithm  only  requires 
measurements,  whereas  the  EPC  and  MV'S  algorithms  also  require  bounds  on  the  errors.”  .4s  we  know 
from  above,  however,  the  difference  between  VVRLS  and  F-H  OBE  (or  any  UOBE  algorithm)  is  not  as 
different  as  one  may  infer  from  these  comments.  .4s  Norton  and  Mo  have  recently  written  concerning 
F-H  OBE  [86],  “The  algorithm  [F-H  OBE]  differs  from  recursive  least  squares  by  an  extra  data-dependent 
scaling  of  [the  ellipsoid  matrix  ^(n)].”  In  a  1989  paper,  Deller  [26]  similarly  recognized  that  “[F-H]  OBE 
is  'VV^RLS  with  time  varying  weights'."  It  is  this  recognition,  combined  with  Norton  and  Mo's  formulation 
for  adaptive  ellipsoid  processing,  that  led  to  the  UOBE  formulation  taken  in  this  paper.  (We  will  see  that 
UOBE  also  embraces  adaptive  strategies  below.) 

Until  recently,  however,  this  uniformity  of  ellipsoid  algorithms  was  not  fully  appreciated.  In  the  early 
and  mid  1980’s,  Deller  and  students  [30].[37],[38].[70],  recognized  the  similarity  of  F-H  OBE  to  RLS, 
attempted  to  associate  an  ellipsoid  directly  with  VVRLS  rather  than  conversely.  The  result  is  an  OBE- 
like  algorithm  which  is  exactly  interpretable  as  conventional  VVRLS  (i.e..  only  equations  (24)  and  (25) 
with  C,(n)  =  C(n)  or  (^(n)  =  1,  Vn).  with  the  sequence  of  optimal  ellipsoid  parameters  A*(-)  simply 
interpretable  as  the  weights  used  in  the  process.  Fogel  and  Huang's  volume  measure  (iv{n)  has  been  used 
as  the  optimization  criterion,  but  the  trace  measure  can  be  employed  as  well.  In  later  work,  the  use  of 
QR- VVRLS  Wcis  suggested  to  enhance  the  method  in  a  number  of  ways  to  be  described  [26]. 

The  algorithm  proposed  by  Deller  and  others  has  been  called  set  membership  weighted  recursive  least 
squares  (SM-WRLS)  to  emphasize  the  nature  of  their  approach.  S.VI-VVRLS  is,  in  fact,  a  UOBE  algorithm 
with  the  following  conditions; 

1.  C( ^)  =  1  for  ri; 

2.  QR-VVRLS  is  used  to  implement  the  recursions  (but  .VIIL- VVRLS  can  also  be  used); 

3.  Volume  measure  Pvin)  is  used  as  the  optimization  criterion  (but  pt{ri)  can  be  used  as  well). 

6.3.2  Illustration 

.4t  appropriate  points  in  the  paper,  we  will  illustrate  the  behavior  and  performance  of  the  UOBE  ap¬ 
proach  through  simulation  studies.  .4  common  set  of  two  systems  will  be  used  which  will  be  introduced 
here.  For  simplicity,  the  SM-VVRLS  algorithm  is  used  as  the  nominal  algorithm.  The  volume  measure 
is  employed  as  the  optimization  criterion.  .Many  other  example  studies  are  found  in  the  literature  (e.g. 


[9],[24],[26],[27],[41],[49],[86],[89],[101])  including  some  with  real  data  (e.g.  [29], [77]).  In  particular,  many 
studies  with  time-mran'ant  systems  have  been  published,  so  we  advance  immediately  to  the  case  of  time 
varying  parameters. 

W'e  consider  the  estimation  of  the  parameters  of  a  real  signal,  real  parameter,  time  varying  AR(2) 
model  of  the  form'^ 

y{t)  =  axAt)y{t  -  1)  +  a2.{t)y(t  -  2)  +  s,{t)  (51) 

with  0.{t)  =  [ai,(t)  a2,(t)]^.  Two  similar  systems  will  be  used.  The  first  is  a  more  severe  test  of  the 
tracking  abibty  of  an  identification  algorithm.  In  this  case  the  pole  pair  of  the  system  alternates  abruptly 
between  0.8  ±  jO.2  and  -0.8  ±  jO.2  every  1000  samples,  so  that  aj,  alternates  between  ±1.6,  and  a2« 
remains  constant  at  0.68.  In  the  second  system,  the  poles  alternate  between  the  same  sets  of  conjugate 
pairs,  but  the  transisitions  are  gradual  rather  than  abrupt.  In  this  case  ax,  changes  linearly  (between  ±1.6 
and  -1.6  and  vica  versa)  over  1000  point  ranges,  then  remains  fixed  for  1000  point  intervals.  We  shall 
refer  to  the  two  systems  as  the  ’’fast"  and  "slow”  systems,  respectively,  though  we  hasten  to  point  out  the 
the  “slowly'’  time  varying  system  does  not  represent  a  trivial  tracking  problem.  Since  only  O],  changes  in 
each  case,  it  is  the  more  interesting  parameter  to  observe.  To  conserve  space,  we  show  only  the  results  for 
a\,  in  each  simulation.  We  found  nothing  particularly  unusual  or  unexpected  in  the  results  for 

7.000  point  sequence.  y{  • ).  was  generated  by  driving  the  parameter  sets  with  an  uncorrelated  sequence. 
i'.(').  which  was  uniformly  distributed  on  [-0.5. 0.5].  5,(-)  was  generated  using  a  random  number  generator 
based  on  a  subtractive  method  [94]. 

The  inherent  ability  of  LOBE  algorithms  (without  any  special  adaptive  provisions)  to  adap*^  and  track 
time  varying  parameters  is  sometimes  quite  dramatic.  In  this  work,  we  have  intentionally  chosen  systems  for 
which  SM-WRLS  exhibits  less  than  excellent  tracking  performance  in  order  to  iUustrate  several  important 
points.  For  reference,  in  Figs.  6  (a)  and  (b),  we  show  the  results  of  using  standard  RLS  in  the  identification 
(no  data  selection  and  optimization.  A.Jn)  =  C(^)  =  1  for  all  n).  The  algorithm  is  clearly  incapable  of 
following  the  parameters  in  either  the  fast  or  slow  case.  The  RLS  results  can  be  contrasted  with  those  using 
SM-WRLS  in  Fig.  7.  Though  not  excellent,  the  SM-WRLS  results  are  improved  with  respect  to  RLS  (at 
least  initially),  and  it  is  important  to  note  that  this  improved  performance  comes  with  somewhat  improved 
computational  efficiency  (more  on  this  below).  In  this  case  SM-WRLS  uses  only  the  fractions  p  =  0.020 
( fa.st  system)  and  p  =  0.025  (slow)  of  the  data  and  yet  yields  better  parameters  estimates  in  the  early 
stages  of  identification.  However,  two  important  points  are  to  be  emphasized  here.  First.  SM-WRLS  does 
not  rflinhly  and  predictably  adapt  to  time  varying  systems.  Even  for  more  slowly  time  varying  sysrems. 

‘^.Vote  (hat  for  the  first  time  in  this  paper,  we  have  allowed  the  dynamics  of  the  "true"  system  to  be  time  varying  The 
theoretical  developments  above  do  not  strictly  support  the  identification  of  such  systems,  so  the  issue  of  adaptation  is  an 
important  one  to  which  will  shall  pay  close  attention  in  the  following. 


SM-VVRLS  (and  UOBE  algorithms  in  general)  cannot  be  used  in  adaptive  schemes  with  confidence.  This 
motivates  the  need  for  specifically  adaptive  techniques.  Further,  a  deeper  analysis  of  this  situation  reveals 
a  very  important  second  point.  The  quantities  p.v(n)  and  the  sign  of  K{n)  (sgn{K(n)})  are  shown  in  Figs. 
8(a)  and  8(b),  respectively.  At  time  n  =  1000  we  see  a  very  disturbing  development.  The  volume  begins 
to  increase,  and  the  parameter  k  becomes  negative.  Both  of  these  trends  are  in  violation  of  theory,  but 
they  arise  precisely  because  the  theoretical  development  does  not  strictly  support  the  identification  of  time 
varying  systems.  The  most  revealing  anomaly  is  the  appearance  of  a  negative  k  which  indicates  an  "ellipsoid 
of  negative  dimensions.”  Theoretically  speaking,  the  algorithm  has  become  completely  disintegrated  at 
time  n  =  1000,  and  its  performance  is  therefore  not  predictable  based  on  SM  principles.  Nevertheless,  we 
see  that  the  parameters  continue  to  be  tracked  rather  well  for  at  least  another  cycle.  .Apparently,  there 
is  significant  benefit  to  using  the  "optimization”  process  even  if  the  success  is  not  analyzable.  In  fact,  we 
have  seen  this  phenomenon  in  many  other  simulations.  It  is  more  likely  to  occur  with  rapid  changes  in 
dynamics  (as  we  discu.ss  below),  but  may  occur  in  slower  systems  as  well.  The  conclusion  is  that,  not  only 
is  the  apparent  adaptive  capability  of  I'OBE  algorithms  unpredictable,  but  even  when  good  tracking  does 
occur,  the  good  performance  is  not  necessarily  attributable  to  the  proper  principles  of  the  underlying  the 
methods.  In  turn,  this  latter  observation  adds  to  the  uncertainty  in  predicting  adaptive  performance.  We 
shall  return  :o  these  points  in  future  discussions. 

6.4  Adaptive  UOBE  Algorithms 
6.4.1  Introduction 

While  UOBE  algorithms  have  been  observed  to  have  inherent  and  fortuitous  adaptive  capabilities  as  a  result 
of  their  optimal  weighting  strategies,  we  have  just  seen  that  these  capabilites  are  unpredictable  at  best.  .Ac¬ 
cordingly,  measures  have  been  suggested  by  Norton  and  Mo  [>^6],  and  Ddlcr  and  Odeh  [26], [27]. [.31]. [82], [87]  - 
[.'^9j  to  render  explicit  and  controlable  adaptation'^.  Of  three  general  methods  suggested  by  .Norton  and 
.Mo.  the  bound  incrementing  method  does  not  closely  follow  the  UOBE  paradigm  established  above,  so 
we  refer  the  reader  to  the  original  paper  for  details.  The  other  two  Norton  methods  are  discussed  below. 
.All  adaptive  strategies  for  ellipsoid  algorithms  work  on  the  general  principle  of  iteratively  inflating  the 
"current"  ellipsoid  in  some  sense  before  considering  an  incoming  data  set.  The  basis  for  this  inflation 
is  to  contain  the  shifting  true  parameters  while  at  the  same  time  increasing  some  measure  of  "size"  of 
the  ellipsoid  (see  (42)  and  (43)  below),  making  it  more  likely  that  the  incoming  data,  with  potentially 
novel  information,  will  be  selected.  Deller  and  Odeh  have  suggested  the  use  of  QR-WRLS  in  the  adaptive 
methods  because  of  the  convenient  computational  intorpretation  of  the  procedure.  In  principle,  however. 

’  \Additionallv,  Norton  and  .\lo  briefly  discu.ss  adaptive  strategie.s  for  other  than  ellipsoid  algorithms. 
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MIL-WRLS  can  be  used  as  well. 


6.4.2  Exponential  Forgetting 

A  general  UOBE  algorithm  can  be  designed  to  be  explicitly  adaptive  within  the  established  framework 
by  judicious  choice  of  scaling  sequence  <,'(•).  One  seemingly  reasonable  choice  is  to  let  the  scaling  effect  a 
conventional  forgetting  factor, 

=  O'.  0  <  o  <  1  (52) 

for  all  71.  The  computational  mechanisms  for  including  such  a  forgetting  factor  into  both  forms  of  WRLS 
is  found  in  Section  4.2.  Additionally,  if  another  scaling  sequence  ((■)  is  part  of  the  algorithm  for  some 
purpose  other  than  adaptation  (e.g.  in  F-H  OBE).  then  the  sequence  ((■)/a  can  used  for  scaling  in  order 
to  achieve  forgetting.  Deller  and  Odeh  have  called  this  method  exponential  forgetting,  while  Norton  and 
Mo  call  it  scalar  bound  inflation  because  of  its  equivalence  to  progressively  “loosening”  the  past  7  bounds 
(see  (29))  as  time  goes  on.  Norton  and  Mo  also  point  out  that  once  any  past  optimal  weight  is  tampered 
with,  all  weights  in  its  future  become  suboptimal  in  the  sense  considered  above  and  should,  in  principle, 
be  reevaluated.  .As  they  acknowledge,  however,  this  is  not  practically  feasible  in  most  applications. 

One  important  detail  must  be  made  clear.  We  have  noted  in  the  discussion  surrounding  (46)  that  weight 
scaling  will  effect  both  the  covariance  matrix  C(r  -  1)  and  parameter  K{n  -  1)  in  such  a  way  that  the 
overaU  ellipsoid  matrix  is  not  affected.  This  means  that  the  “expansion"  we  desire  in  the  present  context 
will  not  take  place  if  the  scaling  is  carried  out  “properly.”  The  remedy  is  to  scale  only  the  covariance 
matrix  prior  to  optimization.  That  is,  the  scaled  matrix  Cjfn  -  1 )  is  used  in  constructing  polynomial 
(44)  or  (45),  but  «(n  -  1)  is  not  scaled  until  after  the  data  set  is  considered.  It  will  be  noted  that  a 

formal  problem  arises  with  respect  to  our  previous  discussion,  since  the  weights  A„(l) . A„(n  -  1)  are 

used  in  the  scaled  covariance  matrix,  wliile  the  weights  An_i(  1), . . . ,  A„_i ( r;  -  1)  remain  in  K(n  ~  1).  This 
nuance,  however,  is  necessary  to  achieve  the  desired  result.  Since  the  theoretical  developments  underlying 
the  I'OBE  algorithm  do  not.  strictly  speaking,  support  identification  of  time- varying  systems,  the  use  of 
UOBE  for  adaptive  purposes  is  based  on  heuristic  procedures  of  which  this  "improper"  scaling  is  a  part. 

With  the  exception  of  the  minor  issue  di.scussed  above,  exponential  fogetting  amounts  to  a  UOBE 
algorithm  with  non-unity  scaling.  .Accordingly,  it  is  somewhat  inefficient  because  O{0.pm^  +  km)  multiplies 
are  required  at  each  n  just  to  implement  the  forgetting  factor  (see  Section  6.5.1).  Further,  it  has  not  been 
found  to  be  effective  for  adaptation  in  simulations,  unless  the  system  dynamics  are  changing  rather  slowly 
[S7].  We  .shall  discuss  this  effect  in  the  simulations  below.  The  reason  is  that  the  exponential  der<ay  of  the 
influence  of  past  data  sets  is  frequently  not  fast  enough  to  discount  very  heavily  weighted  data,  so  that 
the  estimate  does  not  respond  to  fast  changes  in  the  system  dynamics.  To  counter  this  problem,  a  small  o 
might  be  proposed,  but  this  has  the  effect  of  creating  a  very  small  effective  "window”  which,  in  turn,  leads 


to  high  variability  and  loss  of  spectral  resolution.  From  the  point  of  view  of  the  ellipsoid,  the  pre-scaling 
by  a  results  in  an  inflation  in  the  volume  by  a  factor  inversely  proportional  to  Therefore,  a  large  a 

results  in  little  change  in  the  ellipsoid,  while  a  small  q  causes  severe  inflation  of  the  ellipsoid  and  induces  a 
series  of  "oscillations”  in  the  ellipsoid  size.  Further,  this  cycle  of  expanding  and  shrinking  ellips..ids  causes 
a  tendency  to  d,ccept  more  data  sets.  Therefore,  from  the  SM  point  of  view,  small  values  of  q  are  least 
desirable.  These  phenomena  will  be  illustrated  in  the  simulation  studies  below. 

The  adaptive  UOBE  algorithms  to  which  we  now  turn  do  not  depend  on  a  fixed  factor,  such  as  a. 
to  expand  the  ellipsoid  volume.  However,  these  algorithms  expand  the  ellipsoid  by  (selectively)  removing 
previously  accepted  influential  data  sets  from  the  system,  either  partially  or  completely,  and  therefore, 
relinquishing  their  influence  on  the  current  ellip.soid.  thereby  allowing  it  to  expand  and  adapt  to  the 
changes  in  the  signal  dynamics. 

6.4.3  Forgetting  by  Back-Rotation 

The  forms  of  adaptation  to  be  discussed  here  do  not  fit  as  neatly  into  our  previous  formalisms  as  does 
exponential  forgetting.  Let  us  begin  with  the  general  UOBE  algorithm  for  which  the  scaling  sequence 
is  (,’(■)•  Having  obtained  an  estimate  0(n  -  1)  with  associated  covariance  matrix  C[n  -  1).  we  wish  to 
consider  the  incoming  data  set  {y(  n).  x{  n)).  Before  doing  .so,  however,  and  even  prior  to  .scaling,  we  adjust 
the  existing  system  of  equations  in  order  to  "downweight”  the  influence  of  some,  or  aU.  of  the  previous 
data  sets.  The  means  by  which  the  existing  data  sets  are  modified  is  to.  in  effect,  introduce  different 
minimization  weights.  In  the  present  situation,  we  wish  to  change  (in  general,  all)  w'eights  used  at  time 
n  -  1,  t  G  [l.u  -  1]  to  a  new  set,  say  A„_i,,((0-  t  €  [l.n  -  1]  We  assume  that  the  new  set  of 

weights  is  not  obtained  by  simple  scaling,  but  restrict  ourselves  to  the  rase  in  which  the  new  weights  will 
be  of  the  form 

\n-\Ain  -  p.,  te[\.n-\]  (53) 

where  0  <  ^,i_i(f)  <  1.  In  effect,  we  wish  remove  the  a  fraction  equivalent  to  y,i_i(t)  of  the  contribution 
of  the  (lata  set  at  time  t  from  the  estimate.  Not  surprisingly,  this  ran  be  accomplished  by  treating 
(  y(  t ),  j;(  t))  as  a  new  data  set  with  "weight”  -yn-  .(<)An-i(0-  In  .MIL-WRLS  context,  no  modifications 
to  the  basic  algorithm  are  required.  In  the  context  of  QR-WKLS  where  the  square  root  of  the  weight  is 
takei.  (see  discussion  below  (28)),  this  is  achieved  by  using  weight  nuid  introducing 

some  sign  changes  in  the  algorithm  [26].[27].[88]. 

The  method  by  which  an  data  set  is  coniplPtely  removed  from  the  previous  system  using  QR  decom- 
p.isition  t)y  (livens'  rotations  has  l)een  called  hark  rotation  in  the  papers  cited  above.  \Ve  will  use  this 
term  to  refer  to  removal  by  both  QR-V\’RLS  and  MIL-WRL.S  even  though  it  loses  its  technical  significance 
for  the  latter.  The  technique  to  partially  remove  a  prior  data  set  is  a  simple  generalization  suggested  in 


[32],[87],[88].  Let  us  now  formalize  this  procedure,  focusing  on  QR-WRLS  (similar  developments  can  be 
obtained  for  the  MIL-WRLS  version). 

Suppose  we.  in  principle,  sequentially  modify  weights  as  described  above,  beginning  at  time  t  =  \.  The 
following  (and  similar)  quantities  will  pertain  to  the  “downdated"  system  of  equations  whose  weights  have 
been  modified  to  time  t:  C ,<(  n  -  1:0.  n  -  1 :  / ).  n  -  1 :  t )■  &'i(  "  -  \:t).K4(n  -  1 :  t ).  where  each  is 
similar  to  fanfiliar  quantities  in  the  foregoing  discussions.  We  also  omit  the  second  argument  if  it  is  n  —  1. 
For  example.  n  —  1 )  ='  C.;{  n  —  I ;  «  —  1 ).  Following  the  modification  of  the  data  set.  the  downdated 
equation  to  be  solved  in  the  QR-WRLS  method  (if  the  solution  were  desired)  is 

T  i(n  -  [:t)&  i[n  -  1:0  =  D\j(  —  1 :  0-  (’’4 ) 

The  downdated  ellipsoid  matrix  is  C,i{n  —  l:/)/K/(n  —  1:0  where 

C;(n-1:0  =  T^ln  -  l:/)T,i(n  -  1:0  .  !b5) 

=  ^i.,i(n-l)+  k4{n-\-.t).  (-56) 


with  Sx.din  -  1 )  '=  tr{Z)(^.^(  n  -  \)D],i{i)  -  1 )}  and 

kiin  -  1:0  =  ^'.._,(/)A„_,(07(Q  (l  -  7“'(n  II  y(t)  H'^)  ■  {■") 

The  quantity  k(  n  -  1:0)  =  k(  n  -  1 )  represents  the  updated  value  of  k  which  includes  {y{n  -  1 ).  i(  n  -  1 )). 
Equations  (-76)  and  (  57)  follow  immediately  from  the  definition  of  k  found  in  (40)  and  a  basic  understanding 
of  the  back  rotation  process  being  undertaken.  Following  all  necessary  dowj, dating  just  prior  to  time  n, 
the  algorithm  tises  the  downdated  system  to  ccmipnte  the  downdated  nm/  scaled  quantities 


'=  x^(n)C7’(n  -  [)x[n)d(n)  =  x'Un)C^^l{n  -  l)a;(n). 


( 5,'^ ) 


//,'  pOO  =  x^U  ri  )C  —  l]x{u]C~i  ti)  =  x^^  {n)C,f‘Jn  -  l)a;(n)  (necessary  for  trace  only).  (50) 


and 


K.iin  ~  1 ) 


((iO) 


Cl "  - 

In  turn,  these  numbers  are  used  in  place  of  their  "non-downdated''  counterinirts  in  (44)  or  (45)  to  test  for 
the  existence  of.  and  to  compute,  the  oj)timal  weight  for  ( y( /» ).  x(  n ) ).  Once  the  optimal  A*);;  )  is  found, 
we  define 

A„  _  I  d  / 1.  /  =  1.2 . II  —  1 

Ar.ini.  t  =  u 


X.At)  = 


(01) 


for  the  next  iterati<ni. 

The  process  described  above  wijuld  appear  to  l)e  extraordinarily  computationally  expensive  in  general, 
since  eacf'  past  wejiiht  is  modified  at  oach  n  —  1.  Recall,  however,  that  "inost"  data  sets  are  nevf'r 


included  in  the  estimate  in  the  first  place  {A*(n)  =  0)  and  therefore  the  system  need  not  be  downdated 
at  these  times.  If  the  data  set  at  time  t.  for  example,  was  not  included  in  the  estimate,  then  formally 
Cdi  fi-  =  C{n-\;t),  Tdi  7i  -  l  \  t  +  [)  -  T(n  -  l\t),  etc.,  and  no  computation  is  required.  similar 

situation  obtains  if  a  data  set,  say  at  time  t.  was  completely  removed  by  back-rotation  so  that  Xn-i(t}  —  0. 
In  this  case,  no  computational  effort  is  required  to  downdate  this  data  set  at  time  n  -  1.  Further,  in  many 
cases  the  modification  of  a  particular  data  set  is  not  desired.  If.  for  example,  the  data  set  at  t  is  not  to  be 
altered,  then  =  0.  and  no  computation  is  necessary.  Finally,  note  that  when  the  "new"  data  set  at 

n  is  rejected  ( A'(ri )  =  0).  then  Tin]  =  T  dn  -  1 )  and  0(n)  =  6>,/(  n  -  1 ).  and.  once  again,  no  computation 
is  actually  required. 

.\  wide  range  of  adaptation  strategies  is  inherent  in  the  general  formulation  described  above,  many  of 
them  computationally  inexpensive.  Three  rases  are  considered: 

1.  /  is  a  constant  u'indow  length  and.  for  all  n. 

[  1.  t  =  n  -  I 

Fn-)(0={  :  (62) 

I  0.  other  t 


2.  I  is  a  constant  window  length  and.  for  all  n.  1  -  y„_i(f)  is  zero  prior  to  time  n  -  I  +  \  and  smoothly 
(perhaps  linearly)  tapers  to  unity  at  t'me  n. 


d.  71, _i  is  some  past  set  of  equations  to  be  “forgotten."  and. 


I  1. 

I  0.  t  ^  71,_i 


(6d) 


Fhe  first  case  above  corresponds  to  the  use  of  a  sliding  window  of  length  /,  outside  of  which  ail 
previous  (lata  sets  are  completely  removed.  .N'ortcn  and  .\Io  have  railed  this  case  fixed  memor'g  bounding 
[xfi]  while  Deller  and  Odeh  have  called  it  simply  inndou'ing  and  have  suggested  an  efficient  algorithm  for 
implementing  it  [32], [87], [88].  The  estimate  at  time  n  covers  the  range  [n  —  /  -t-  l,n].  The  windowing 
technique  is  made  possible  by  the  ability  to  completely  and  systematically  remove  data  sets  at  the  trailing 
edge  of  the  window.  Only  one  back-rotation  is  required  prior  to  optimizing  at  time  n.  and  this  is  only 
necessary  if  n  —  I )  =  A*_((  n  —  1)  ^  Q. 

Case  2  represents  another  windowed,  or  finite  memory,  approach,  but  in  this  case  the  window  is 
[)ermitted  to  taper  smoothly  to  zero  as  it  moves  into  the  past.  For  example,  the  effective  weights  might 
decrease  linearly  when  moving  toward  the  trailing  edge  of  the  window.  Hence,  the  data  set  at  the  trailing 
edge  has  an  effective  weight  of  /“'A,;_i(fi  1  —  /)  and  the  data  set  to  be  rotated  in  has  an  effective'  weight 

of  A’(n).  To  reiterate  an  important  point  made  in  the  general  discussion  above,  although  (uuh  data  -'et 


must  be  partially  rotated  out  I  times,  only  those  data  sets  that  were  previously  accepted  (in  the  past  I 
recursions)  need  to  be  considered  by  the  algorithm.  Let  us  refer  to  this  method  as  tapered  forgetting. 

We  remark  that  a  tapered  window  can  smooth  the  estimate,  but  at  the  expense  of  a  significant  amount 
of  computation.  Each  accepted  point  must  be  back  rotated  about  /  times  where  it  can  be  true  that  /  >>  in. 
Depending  on  the  circustances,  the  extra  computation  required  to  implement  the  smoother  window  may  not 
be  warranted  by  the  extra  amount  of  computation  (see  simulations  below  and  [87]  for  further  discussion). 

Case  3  is  a  different  type  of  strategy  which  Deller  and  Odeh  call  .‘‘elective  forgetting.  This  technique 
selectively  chooses  the  data  sets  to  be  removed  from  the  system  based  on  certain  user  defined  criteria  in 
order  to  remove  their  influence  from  the  system.  The  selection  process  can  be,  for  example,  to  remove  (or 
downweight)  only  the  previously  heavily  weighted  data  .sets,  to  remove  the  data  sets  that  were  accepted 
in  regions  of  abrupt  change  in  the  signal  dynamics,  or  to  remove  the  data  sets  starting  from  the  first  data 
set  and  proceeding  sequentially.  Whatever  the  criteria,  a  fundamental  issue  is  to  detect  when  adaptation 
is  needed  to  improve  the  parameter  estimates.  This  issue  is  further  investigated  in  the  simulation  studies 
below. 

6.4.4  Illustration 

The  simulation  results  of  the  several  variations  on  the  general  adaptive  SM-WRLS  algorithm  are  shown. 
We  continue  with  the  example  initiated  in  Section  6.3.2.  The  reader  is  reminded  that  only  the  results  for 
ai,  are  shown. 

The  first  experiment  concerns  the  use  of  exponential  forgetting.  We  noted  above  that  this  form  of 
adaptation  will  often  fail  to  track  quickly  varying  parameters.  This  was  the  case  with  both  the  “slow" 
and  "fast"  systems  used  here  for  any  rea.sonable  forgetting  factor.  The  problem  is  the  inability  to  "forget" 
heavily  weighted  data  quickly  enough,  .-\ccordingly,  we  tried  the  experimental  procedure  of  replacing  any 
optimal  weight  by  unity  before  incorporating  the  chosen  data  .set.  The  results  for  forgetting  factor  n  =  0.99 
are  shown  in  Fig.  9.  Whereas  the  "weight  override"  might  be  expected  to  cause  a  vastly  increased  fraction 
of  the  data  to  be  used,  in  fact  only  fractions  p  -  0.073  and  p  =  0.094  of  the  data  were  used  for  the 
fast  and  slow  systems,  respectively.  Clearly,  the  tracking  is  very  good  for  the  slow  system,  and  perhaps 
acceptable  for  some  purposes  for  the  Cst  system,  o  =  0.99  was  the  smallest  forgetting  factor  which  would 
give  "acceptable"  tracking  in  the  sense  of  reaching  the  "target"  values  during  the  each  cycle  in  the  fast 
case.  This  large  forgetting  factor  is  responsible  for  tiie  variability  seen  in  the  regions  which  are  easier  to 
track.  The  estimate  can,  of  course,  be  smoothed  by  choice  of  a  smaller  o.  In  the  slow  case,  the  estimate 
■  an  be  smoothed  considerably  before  time  resolution  is  lost. 

Figure  10  shows  the  volume  traces,  and  Fig.  II  sgn{K(n)}.  as  functions  of  n.  .\s  in  the  "nonadapi ive" 
experiments,  we  observe  a  tendency  for  k  to  become  negative  when  the  i)arameters  change  abruptly.  The 
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problem  is  not  as  “serious"  as  it  was  with  the  nonadaptive  cases  because  the  algorithm  tends  to  “recover." 
That  is.  when  «  goes  negative,  the  volume  goes  into  a  trend  of  expansion  (due  to  forgetting)  ultimately 
leading  to  a  positive  k.  We  can  imagine  that  the  ellipsoid  ultimately  becomes  large  enough  to  “recapture" 
the  moving  parameters,  and  bring  the  identification  back  into  line  with  the  underlying  principles.  However, 
a  positive  «  is  a  necessary  but  not  sufficient  condition  for  this  to  be  true  (for  the  ellipsoid  to  contain  the 
true  parameters),  so  we  must  be  careful  with  this  analysis.  In  the  use  of  specifically  adaptive  algorithms, 
generally  we  observe  that  the  identification  does  enter  phases  in  which  it  operates  outside  the  principles  of 
S.VI  identification,  but  that  it  tends  to  recover  and  operate  properly  due  to  the  adaptation  measures. 

Figure  12  shows  the  simulation  result  of  the  windowed  SM-WRLS  algorithm  using  a  window  of  length 
250.  This  strategy  uses  only  the  fraction  p  =  0.094  of  the  data  for  the  fast  system,  and  p  =  0.10  for  the 
slow  .system.  Since  most  of  the  data  sets  rotated  into  the  system  are  eventually  rotated  out,  this  strategy 
effectively  uses  about  twice  the  number  of  data  sets  rotated  in  (6  ~  p).  More  data  than  with  the  unmodified 
SM-WRLS  algorithm  are  used,  but  more  accurate  estimates  result  and  the  time  varying  parameters  are 
tracked  more  quickly  and  accurately.  For  the  slow  system,  we  observe  that  the  algorithm  behaves  properly 
(in  the  sense  that  k  remains  positive)  for  nearly  the  entire  range.  For  the  fast  system,  there  are  relatively 
short  recovery  phases  (=s  100  points)  required  after  each  abrupt  change  in  dynamics.  The  volume  traces 
behave  as  expected  with  trends  toward  increase  (due  to  “forgetting")  interrupted  by  occassional  decreases 
as  data  sets  are  accepted.  .A.n  example  for  the  slow  case  is  shown  in  Fig.  13. 

.As  expected,  the  estimates  are  smoothed,  time  resolution  lessened,  and  the  fraction  of  accepted  points 
decreases,  as  window  lengths  increase.  The  parameter  estimates  for  the  fast  system  and  window  length 
I  =  500  are  shown  in  Fig.  14  as  an  example.  The  fractions  of  data  accepted  are  p  =  0.070  and  p  =  0.060 
for  the  fast  and  slow  systems,  respectively.  .Also  not  unexpectedly,  recovery  periods,  which  were  virtually 
none.xistent  for  the  slow  system  with  I  =  250,  are  now  present  with  I  =  500.  though  still  for  a  small  fraction 
of  the  time.  The  recovery  phases  for  the  fast  case  increase  in  duration  so  that  they  now  occupied  more 
than  one- third  of  the  range. 

.As  the  window  length  continues  to  increase,  the  effects  reported  above  continue  to  change  in  the  ex¬ 
pected  ways.  In  particular,  it  is  not  unexpected  that  at  some  point,  the  process  would  begin  to  disintegrate 
from  a  theoretical  point  of  view,  since  as  /  —  ^c,  the  “windowed"  algorithm  approaches  “nonadaptive" 
SM-WRLS.  In  fart,  the  recovery  phases  for  the  fast  system  continue  to  increase  until  at  I  =  1000.  the 
parameter  k  is  negative  for  nearly  the  entire  range  following  the  initial  change.  The  process  erodes  and 
fails  to  track  properly  after  the  first  one  and  one-half  cycles.  Interestingly,  (jnly  the  fraction  p  =  0.0.30 
of  the  data  are  used  in  this  estimate,  and  most  of  these  are  taken  in  the  initial  cycles.  The  parameter 
estimate  and  s'  are  shown  in  Figs.  i5(a)  and  15(b)  for  this  rase. 

.A,-,  an  aside,  we  observe  that  this  and  similar  LOIIF  algorithms  are  frequently  capable  of  tracking  while 
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using  small  fractions  of  data,  even  "in  violation  of  theory”  (k  <  0).  However,  empirically,  the  estimate 
frequently  diverges  after  a  problem  dependent  interval.  This  suggests  the  possibility  of  monitoring  k  and 
“resetting”  the  algorithm  after  a  sustained  period  of  “violation.”  (The  .selective  forgetting  approach  to 
be  described  can  be  interpreted  as  a  highly  conservative  version  of  this  procedure.)  Such  a  procedure 
would  be  quite  unpredictable  unless  the  theoretical  analysis  the  of  the  process  under  conditions  of  negative 
K  is  forthcoming.  In  fact,  this  "unpredicitability"  is  the  same  problem  encountered  when  no  adaptation 
measures  are  taken,  but  with  some  alleviation  or  postponement  of  the  undesirable  performance. 

The  selective  forgetting  strategy  selects  the  data  sets  to  be  (partially  or  completely)  removed  from  the 
estimate  according  to  certain  criteria  in  order  to  remove  their  influence  on  the  result.  In  keeping  with 
foregoing  developments,  the  selection  procedure  used  here  is  to  monitor  the  parameter  k  for  positivity. 
When  it  is  found  that  «(«)  <  0  for  some  u.  we  simply  back-rotate  previously  accepted  data  sets,  beginning 
with  the  oldest  data  set  remaining  in  the  estimate,  until  this  number  becomes  positive.  We  reiterate 
that  K(n)  merely  being  positive  does  not  insure  that  the  true  parameters  have  returned  to  the  interior 
of  the  bounding  ellipsoid,  so  the  procedure  is  purely  experimental.  This  technique  yields  the  simulation 
results  shown  in  Figs.  16(a)  and  16(b)  for  the  fast  and  slow  systems,  respectively.  The  identification  of 
the  fast  system  uses  only  p  =  0.050  of  the  data.  84.0%  of  which  are  back-rotated  for  a'^aptation.  so  that 
6  =  0.042.  For  the  slow  system,  the  rates  are  p  =  0.064  and  h  =  0.047.  While  usually  requiring  even 
less  computational  effort,  this  method  is  seen  to  provide  superior  estimates  to  those  obtained  from  the 
other  adaptive  techniques.  For  the  fast  .system,  it  is  noted  that  large  errors  occur  in  the  estimates  at  the 
points  of  discontinuity  in  the  true  parameters.  .At  some  computational  expense,  this  could  be  potentially 
be  resolved  by.  after  forgetting,  removing  the  data  set  (y{n),x{n))  which  caused  K(n)  to  go  negative  and 
recomputing  the  weight  (or  some  similar  heuristic). 

6.5  Implementing  the  UOBE  Algorithm  in  0{m)  Time 
6.5.1  Complexity  of  the  Basic  UOBE 

From  a  signal  processing  point  of  view,  one  of  the  most  interesting  aspect  of  a  UOBE  algorithm  is  its 
inherent  ability  to  select  only  data  points  which  are  informative  in  the  sense  of  refining  the  feasible  set. 
The  fact  that  typically  70%  -  95%  of  the  data  are  rejected  by  this  criterion  would  seem  to  imply  a  remarkable 
savings  in  computation.  We  have  noted  in  Section  6.2.  however,  that  this  is  only  true  to  the  extent  that 
the  SM  preprocessing  of  the  incoming  data  set  is  negligibly  expensive  compared  with  the  inclusioti  of  it  iti 
the  estimate.  In  this  section,  we  examine  sotne  factors  related  to  this  complexity  issue. 

.A  comparison  of  the  computational  loatis  of  the  various  algorithms  disrusseii  in  this  paper  is  shown  in 
Table  1.  .A  complex  floating  point  operation  (  rflop)  is  taken  to  be  appro.ximatel y  one  complex  mi!lti[)licafiori 
phis  one  complex  addition  operation.  .Additions  which  are  unpaired  with  multiplications  are  igtior<>d. 
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The  numbers  shown  arise  from  efficient  procedures  which  avoid  recomputation  of  quantities,  for  example 
e{n,&{n  -  1)),  which  are  shared  among  different  operations.  Only  numbers  dependent  upon  m  and  k  are 
shown  with  constant  (usually  small)  numbers  of  chops  ignored.  Not  shown  in  the  table  are  tallies  to  update 
«(n),  and  the  number  of  operations  needed  to  compute  an  optimal  weight  when  the  data  set  is  accepted. 
«(n)  requires  about  4  chops  in  the  MIL-WRLS  cases,  and  (m  +  l)k  in  the  QR-VVRLS  cases.  Optimal 
weights  require  a  small  number  of  chops  (about  'IS]  which  may  be  thought  of  as  nearly  independent  of 
m  and  k  since  all  quantities  computed  are  used  for  other  purposes.  Figures  shown  are  based  on  volume 
optimization,  but  the  trace  tallies  are  nearly  identical. 

The  following  analyses  are  applicable  to  the  usual  case  in  which  the  number  of  outputs  from  the  system, 
k,  is  small  relative  to  the  number  of  parameters  estimated,  m.  In  fact,  for  simplicity,  let  us  set  k  =  1  {.\IISO 
system).  The  general  conclusions  reached,  however,  are  valid  when  k  <<  m.  and  we  shall  continue  to  show 
y(')  and  £(-,0('))  as  vectors. 

.■\s  a  standard  of  comparison,  we  note  that  conventional  MIL-WRLS  requires  C>(3m^)  chops  per  n  with 
an  additional  0(m^j‘2)  required  to  include  a  scaling  sequence  (^{■).  For  QR-WRLS,  0(2. 5m^)  chops  per  n 
are  required  with  an  additional  O(m^j'l)  needed  for  scaling. 

From  Table  1.  we  may  state  that  the  average  operation  count  for  an  adaptive  UOBE  algorithm  imple¬ 
mented  on  a  sequential  machine  is  appro.ximated  by 

fopt  ~  O(cim^) -(- ,sO(m^/2)  4- 60(c2m^) -I- pO(c37n^)  hops  per  n  (64) 

where,  .s  is  unity  if  the  algorithm  involves  a  scaling  sequence  and/or  a  forgetting  factor  and  is  zero  otherwise; 
p  is  the  average  number  of  data  sets  accepted  per  n\  b  is  the  average  number  of  back-rotations  performed 
per  n:  and  ci,C2  and  C3  are  small  numbers  (all  in  the  range  0.5  -  2.5)  which  depend  upon  whether  MIL- 
WRLS  or  QR-WRLS  is  used.  The  Rest  term  is  due  to  the  procedure  which  checks  for  information  in  the 
incoming  data.  The  others  are  attributable  to  scaling,  adaptation,  and  solution  update,  respectively.  The 
subscript  "opt”  is  used  to  indicate  that  the  proper  optimization  described  above  is  used.  .Apparently,  the 
LOBE  algorithm,  as  presently  formulated,  is  an  process.  The  objective  of  the  section  below  is  to 

demonstrate  a  method  for  reducing  the  effective  complexity  to  C7(  m)  by  reducing  the  checking  cost,  thereby 
making  a  UOBE  algorithm  a  desirable  alternative  to  standard  RLS-based  methods  from  a  computational 
point  of  view.  We  also  mention  a  paraUel  processing  approach  which  likewise  achieves  the  0{m  )  goal. 

Before  detailing  the  methods,  some  points  about  the  use  of  the  appro.ximation  -0{  rn )"  are  necessary. 
The  first  concerns  a  practical  matter.  The  objective  in  the  following  is  to  reduce  the  computational 
complexity  of  the  algorithms  to  an  (ivfmgr  of  0{m)  flops  per  n.  It  will  be  appreciated  that,  without 
data  buffering,  the  data  flow  is  still  limited  by  the  worst  ra.se  Oiro^}  computation.  However,  if  a  buffer 
is  included,  the  algorithm  easily  be  structured  to  operate  in  0{m)  average  time  per  n.  Further,  by  using 
interrupt  driven  processing  of  the  checking  procedure,  it  may  be  possible  to  reduce  the  average  time  even 


;n 


further.  Other  points  concern  algorithmic  details.  We  see  from  (64)  that  the  use  of  a  unity  scaling  sequence 
(SM-WRLS  algorithm)  is  required  in  order  to  avoid  an  invariant  0(m‘l'l)  flops  per  n.  We  specifically 
assume  the  use  of  this  algorithm  below,  although  the  0(rn)  checking  procedure  to  be  developed  does  not 
depend  on  this  choice.  Secondly,  we  note  that  pven  if  the  checking  procedure  can  be  made  0(m).  terms 
bO(m^)  and  pO(m^)  (typically  b  ^  p)  persist  in  (64).  This  means  that  to  truly  achieve  0{m)  complexity, 
b  and  p  must  be  For  large  m,  this  will  not  be  always  be  the  case.  In  fact,  some  experimental 

evidence  suggests,  not  unexpectedly,  that  p  increases,  rather  than  decreases,  with  increasing  m.  For  "large" 
m  (conservatively,  say,  m  >  10),  therefore,  it  is  the  case  that  the  comple.xity  is  reduced  to  0(pni-]  by 
0(m)  checking.  It  should  be  clear  however,  that  neither  0{m)  nor  0{pm^)  complexity  can  be  achieved  if 
the  checking  procedure  remains  We  therefore  pursue  an  0{m)  test  for  information  in  an  incoming 

data  set. 

With  I’OBE,  the  number  of  computations  needed  for  each  u  depends  on  whether  the  corresponding 
data  set  is  accepted  for  processing  by  the  optimization  criterion.  UOBE  is  essentially  reverts  to  MIL-WRLS 
or  QR-WRLS  when  a  data  set  is  accepted.  Since  most  of  the  time  the  data  set  is  rejected,  for  significant 
comple.xity  gain,  a  UOBE  algorithm  must  require  many  fewer  than  0(:im^)  flops  for  checking.  W'e  digress 
momentarily,  therefore,  to  view  some  of  the  details  of  the  checking  procedure. 

In  principle,  the  information  checking  procedure  for  the  volume  or  trace  algorithms  consists  of  forming 
either  Fi,(A)  or  Fd-^)  of  (44)  and  (4.*)).  then  solving  for  a  positive  root.  In  either  case,  however,  the 
polynomial  can  have  at  most  one  positive  root  (see  Proposition  2  in  .Appendix  .A.l).  The  test  therefore 
reduces  to  one  of  testing  the  zero  order  term  for  negativity.  When  the  test  is  successful,  then  the  root 
solving  and  updating  procedes,  requiring  the  standard  .MIL-  or  QR-WRLS  load,  plus  a  few  operations  for 
finding  the  optimal  weight.  The  most  expensive  aspect  of  this  information  test  is  the  computation  of  the 
(luantity  (ii(n)  in  the  case  of  volume  minindzation,  or  //fin)  for  trace  minimization.  (For  generality,  we 
assume  downdating  is  used.  If  this  is  not  tiie  case,  it  is  merely  necessary  to  drop  the  subscripts  "d"  on  all 
(juantities. )  In  the  .MIL-WRLS  case,  this  requires  0{iii")  flops.  Ii\  the  QR-WRLS  case,  a  problem  arises 
beca\ise  (ipn)  depends  upon  the  invfrse  normal  matrix.  Cy'.  which  is  not  otherwise  used  in  the  process. 
Similarly.  H pn)  depends  on  In  the  paper  by  Dellor  [27],  the  following  method  has  been  suggested 

to  sidestep  this  problem  for  (r'.p  n):  Recalling  the  definition  of  (ipn)  and  noting  and  (27).  we  can  write 

( 1^(11]  =  iTy'l  n  -  1  iTy  u  -  1  )j;(  n  )  =  Qi  (  n)g  p  n)  =  '.\  g.p  n  )  \\^  ■  { ti-'i  ,1 

Since  x(  w  )  -  T^/ (  u  -  1  n  ).  and  t  he  matrix  T'J \  u  -  1 )  is  lower  triangular,  g  p  n)  is  easily  found  from  t  he 
available  (juantities  at  time  n  by  forward  substitution.  The  procedure  can  be  repeated  to  coitipute  II  p  n  \ 
if  needed,  since 


//.<(  0  )  =  g!/ 1  n  iT  I  r>  -  I  }T  »  -  1 "  '>  -  ^  d  ’>  > 


I  hp  n)  j‘ 
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The  total  computational  load  for  this  method  is  0{m^/2)  for  Gd(n)  and  O(m^)  for  H^in)  which  is  far  less 
than  the  effort  required  to  invert  Cj{  n  -  1 ).  When  MIL  —  W RLS  is  used  for  the  covariance  and  parameter 
update  in  UOBE,  the  checking  ("precomputation"  of  Gi(n))  removes  C?(m')  flops  from  the  update  load, 
but  the  checking  does  not  contribute  to  the  update  for  QR-WRLS. 

6.5.2  Suboptimal  Tests  for  Innovation  in  the  Data 

In  spite  of  the  simplifications  suggested  above,  the  computation  of  the  quantities  Gdin)  and  Hd{  n)  remain  of 
0(  m‘)  comple.vity.  Clearly,  the  trick  is  to  try  to  avoid  the  computation  of  these  numbers  in  the  information 
checking  procedure.  Deller  and  Odeh  [32]  have  proposed  a  simple  suboptimal  updating  rule:  Include  the 
data  set  at  time  n  only  if 

jj  €\n.©{n  -  1 ))  ||^<  (67) 

This  rule  is  used  for  both  the  volume  and  trace  minimization  versions  of  the  UOBE  and  is  not  affected  by 
inclusion  of  a  causal  scaling  sequence.  (,'(  •).  The  rationale  for  this  test  is  simple.  The  zero  order  cotficients 
oo  and  6o'  of  (-I-f)  3-nd  ('fo)-  respectively,  will  never  be  positive  if  the  test  is  met.  In  the  volume  case,  for 
e.xample.  the  suboptimal  check  tests  whether  ao  is  negative  if  the  term  -Kd[n  -  l)Gci(n)  is  neglected.  This 
ignored  term  is  always  negative  and  becomes  small  as  «  increases  if  no  forgetting  is  used.  For  a  given  set 

of  preceding  optimal  weights.  A“(  1) . A*(«  -  I),  the  suboptimal  test  will  never  fail  to  accept  a  data  set 

which  would  have  been  accepted  h;j  the  optimal  test.  A  similar  analysis  applies  to  the  coefficient  6o  of  the 
trace  algorithm. 

.\  deeper  analysis  of  this  suboptimal  test  has  been  made  for  the  volume  algorithm  by  Deller  and  Odeh 
[32].  Lot  us  denote  the  estimatior  rrnr  matri.x  at  time  n  by 

0(  n  )  =  ©.  -  0(  n).  (6S) 

Tho  following  inequality  results  imim’diately  from  (37)  -  (10): 

0^ {  n  )C(  n  )0(  ri )  <  k(  n  ).  I  60  I 

Whilo  it  is  tempting  to  view  K{n)  as  a  bound  on  0(  n )  (see  discussion  of  the  D-H  OBE  algorithm  below  i.  it 
is  important  to  note  that  each  side  of  this  inecpiality  is  dependent  upon  A^Jn).  In  ffict.  let  us  temporarily 
write  the  two  key  quantities  as  functions  of  A,,(n).  C(n.  i)  and  ^(n.A„(n)).  and  consider  the  usual 
vf)lume  (juanlity  to  be  minimized  at  time  n. 

fi.  in)  =  (let  [s:|  n.  A,,(  tt )  IC'  (  n.  A,..(  n  ) )].  (  70  i 

It  is  assumed  that  enough  data  sets  have  b(’en  included  in  the  covariance  matri.x  at  time  ri  —  1  so  that  its 
elements  are  large  with  respect  to  the  incoming  data’  ’.  If  a  causal  scaling  sequence  is  included,  the  (pnintit  v 


'^The  validity  of  this  a.ssiimption  depends  to  some  extent  on  the  choice  of  scaling  sequence  £((  )  if  one  is  included 


det  C(n.  An(n))  is  readily  shown  to  be  inonotonicalJy  increasing  with  respect  to  An(7i)  on  the  interval  [0,  ^c  ) 
[87],  with  C(n,0)  =  C(n-  n  -  1 )).  Under  the  assumption  above,  det  C(  n.  An(  n))  will  not  increase 

significantly  over  reasonably  small  values  of  A„(n)-  The  attempt  to  maximize  det  C(  n,  A„(  n))  in  (70) 
causes  a  tendency  to  increase  An(7i)  in  the  \isual  optimization  process.  However,  the  attempt  to  minimize 
K(n,A„(n))  generally  causes  a  tendency  toward  small  values  of  A„(n).  unless  a  minimum  of  K(ri.A„(n)) 
occurs  at  a  ’‘large”  value  of  A„{n).  To  pursue  this  idea  and  further  points  of  the  argument,  we  use  two 
key  facts  about  K(n,An(n))  which  are  given  in  Proposition  A  in  .Appendix  .A.l.  These  are  that  K(n.  Xn(Ti)) 
is  either  monotonically  increasing  on  positive  A's  or  it  has  a  single  minimum.  .A  necessary  and  sufficient 
test  for  that  minimum  is  (67).  .Accordingly,  it  can  be  argued  that;  If  det  C(n.  A„(n))  is  increasing,  but 
not  changing  significantly  over  reasonably  small  values  of  A„{n),  then  it  is  sufficient  to  seek  X’^(n)  which 
minimizes  K(n,A„(n)).  If  n{n,  Xn{n))  is  monotonically  increasing  on  X^in)  >  0.  this  value  is  A,i(r!)  =  0 
which  corresponds  to  rejection  of  iy{n).x{ti)}.  It  suffices,  therefore  to  have  a  test  for  a  minimum  of 
K(n.  Xn(n))  on  positive  A„(n).  .As  noted  above,  a  simple  test  is  embodied  in  condition  (67).  If  this  test  is 
met,  it  is  then  cost  effective  to  proceed  with  the  standard  optimization  centered  on  (44).  Otherwise,  the 
explicit  construction  and  solution  of  no  of  (44)  can  be  avoided. 

In  fact,  this  suboptimal  test  for  innovation  is  similar  to  that  used  in  the  D-H  OBE  algorithm  reported 
in  [24]  and  discussed  in  Section  6.2.2.  The  test  used  in  D-H  OBE  is  to  accept  the  incoming  data  set  only 

ifl5 

;•■(  n.  0(  n  -  1  ))  <  *;(  n)  -  K(  7!  -  1 )  .  (71 ) 

This  inecpiality  likewise  tests  for  a  minimum  of  k(77)  with  respect  to  A,j(77).  and  differs  in  form  from  (67) 
because  of  the  noncausal  scaling  factors  (.see  (’22)  and  surrounding  discussion).  There  has  been  some 
controversy  in  the  literature  as  to  the  meaning  of  this  test.  l)a.sgupta  and  Huang  argue  simply  that  k(  7) ) 
is  ”a  bound  on  the  estimation  error,"  and  should  be  minimized.  Indeed,  the  minimiz.ition  of  K.in)  is  the 
o[)timization  criterion  used  in  D-H  OBE.  and  no  apparent  connection  to  a  set  measure  on  the  underlying 
ellipsoid  is  made.  Dasgupta  and  Huang’s  claim  has  been  disputed  by  .Norton  and  .\lo  [86|  and  is  not  clearly 
supported  by  the  heuristic  arguments  above,  becaust-  the  relative  independence  of  C{n)  and  X,.{n)  is  not 
tenably  argued.  .Nevertheless,  examination  of  the  analytical  arguments  above  does  reveal  some  interesting 
similarities  between  the  use  of  the  D-H  test  and  the  suboptima!  test  (f)7i.  These  revelations  nearly  (but 
not  (|uite)  provide  justification  for  the  D-H  test. 

VS'hiie  it  is  not  exploited  in  the  D-H  OBE  algorithm  for  reasons  ^iiscussed  in  .Section  6.2.2.  the  D-H 
liyperellipsoid  nevertheless  does  have  a  volume  at  each  a.  l  iu  <t  al.  [hx]  have  recently  shown’’'  that  there 

’  '  A  scalar  error  is  shown  since  this  .ilgorithni  is  developed  for  a  SISO  model 

‘‘^Somevvhat  unexpectedly,  perhaps,  because  of  the  noncausal  scale  fartors  an'!  the  nonlinearil ies  in  Anlnl  whii  li  would  be 
expected  to  arise 


is  a  quadratic  equation  in  A  similar  to  (44)  which  must  be  solved  to  find  the  optimal  "volume"  root  (see 
Corollary  2  in  Appendix  A.l).  For  the  sake  of  discussion,  let  us  call  the  weight  which  optimizes  volume 
and  that  which  optimizes  k{u)  A''.  The  volume  quadratic  has  the  amazing  property  that  its  zero  order 
coefficient  is  identical  to  that  in  (44),  ao-  and  may  be  checked  for  negativity  as  a  necessary  and  sufficient 
test  for  the  existence  of  an  optimal  weight  A''.  Similarly  to  the  suboptimal  SM-VVRLS  strategy,  the  D-H 
test  (71)  comes  quite  close  to  being  a  sufficient  test  for  negativity  of  oq,  and  therefore  a  sufficient  test  for 
whether  the  volume  can  be  diminished.  Further,  even  if  the  D-H  weight  A'^  is  not  equal  to  A*^'  (and  it  likely 
will  not  be),  it  can  still  be  shown  to  shrink  the  volume  [68]  as  long  as  A"^'  >  0  exists.  Consequently,  if  (71) 
were  exactly  a  test  for  oq  <  0,  then  it  would  follow  that  the  D-H  algorithm,  by  reducing  k  simultaneously 
reduces  volume.  The  fact  that  A'^  does  not  optimally  minimize  volume  is  apparently  a  small  price  to  pay 
for  the  ability  to  prove  convergence.  Regrettably,  the  test  (71)  is  not  quite  sufficient  to  assure  ao  <  0. 
.Additionally,  it  is  must  be  true  that  G's(n)  >  mk.  This  condition  is  most  likely  to  be  met  for  small  n. 
precisely  when  the  most  data  sets  are  likely  to  be  accepted.  However,  there  is  no  assurance  in  general  that 
this  condition  will  prevail.  Consequently  the  D-H  OBE  test,  while  part  of  a  very  different  approach,  comes 
intriguingly  close  to  being  justified  by  the  same  means  as  the  suboptimal  test  associated  with  SM-VVRLS. 
but  falls  somewhat  short.  Some  further  theoretical  work  may  ultimately  resolve  the  apparent  problem. 

Interestingly,  the  Deller-Odeh  test  (67)  could  be  used  as  a  "suboptimal’'  criterion  for  accepting  data  in 
the  D-H  OBE  algorithm.  That  is.  (71)  is  satisfied  whenever  (67)  is  met.  The  benefit  of  this  suboptimal 
approach  would  be  that  it  would  assure  that  volume  would  be  decreasing  at  each  step  by  minimizing  k. 
thereby  providing  a  clear  pointwise  justification  for  the  D-H  approach.  Before  such  an  approach  were 
adopted,  it  would  be  necessary  to  ascertain  that  the  convergence  result  (which  is  the  raison-d'etre  for  the 
D-H  OBE  algorithm)  is  preserved. 

6.5.3  Computational  Complexity  of  UOBE  Algorithms  with  Suboptimal  Checking 

Recall  that  the  purpose  of  this  pursuit  is  to  find  a  way  to  avoid  the  m^  flops  necessary  to  carry  out  the 
checking  process  in  the  optimal  algorithm.  The  test  (67)  requires  only  C>(m)  cflops.  so  that  the  revised 
operation  count  is*' 


fsuhipt  ni)  -i-  .sO(  m^l'l)  +  6'0(r-jm^)  +  p'0{r2m" )  cflops  per  n  (  i2) 

where  h'  is  the  average  number  of  back-rotations  per  n  under  the  suboptimal  checking  policy:  //  represents 
the  fraction  of  the  data  sets  which  are  included  in  the  update;  and  .s  indicates  whether  scaling  is  used.  .\.s 
we  have  stessed  above,  even  as  p'  —  0  the  suboptimal  algorithm  remains  of  0{m^ j'l)  complexity  per  it  on 

‘  N’otp  that  6((ri|  IS  no  longer  computed  in  the  checkin?  pha.se  so  that  the  operation  count  for  MIL-\VRI,.s  is  the  full 
(P(  irriG 


a  sequential  machine,  unless  SM-WRLS  (C('^)  =  1  Vn)  is  used.  Herein  lies  one  of  the  most  compelling 
reasons  for  the  choice  of  the  simplest  form  of  UOBE  algorithm  in  signal  processing. 

In  light  of  (72),  let  us  briefly  consider  the  computational  loads  imposed  by  the  specific  adaptation 
strategies  described  above.  In  each  case,  we  assume  QR-WRLS  underlies  the  process,  but  the  discussion 
for  MFL-WRLS  is  similar. 

Of  the  adaptation  methods  described  above,  e.xponential  forgetting  is  the  most  expensive  computation¬ 
ally,  unless  the  UOBE  algorithm  already  employs  a  non-unity  scaling  sequence.  If  the  algorithm  does  not 
employ  a  scaling  sequence  (5  =  0  in  (64)  and  (72)).  then  the  inclusion  of  the  forgetting  factor  essentially 
imposes  one  (s  =  1)  and  adds  0{m^/2)  cflops  per  n.  If  the  algorithm  does  contain  a  scaling  sequence, 
then  the  forgetting  factor  can  be  combined  with  it  prior  to  scaling,  requiring  only  one  cflop  per  n. 

Since  back-rotation  is  essentially  equivalent  to  a  covariance  (or  T{n))  update^®  for  an  incoming  data 
set.  each  of  these  rotations  takes  O{‘2.om^)  cflops.  If  b  back-rotations  are  performed  on  the  average  at 
each  n.  then  effectively  O(2.obm^)  additional  operations  are  required  by  the  adaptation  procedure.  Since 
p  is  usually  small,  whether  a  particular  adaptation  strategy  is  cost-effective  depends  on  the  number  b.  For 
simple  windowing,  for  example,  6  «  p  and  the  adaptation  adds  negligibly  to  the  computational  load.  For 
tapered  forgetting,  on  the  other  hand,  b  ss  p/,  where  /  is  the  effective  window  length  which  may  be  quite 
large.  In  this  Ccise,  the  adaptation  might  be  the  dominant  cost  requirement,  completely  overshadowing 
any  savings  gained  by  suboptimal  testing,  for  example.  high  computational  cost,  therefore,  might  be 
incurred  for  the  benefits  of  a  tapered  window  in  the  analysis.  Finally,  the  cost  of  the  selective  forgetting 
routine  depends  entirely  upon  the  criterion  employed  for  deciding  to  back-rotate  a  previous  data  set.  which, 
in  turn,  determines  the  value  of  b.  .An  exan\ple  will  be  discussed  below  in  the  simulation  studies. 

6.5.4  Illustration 

To  illustrate  the  efficient  methods  based  on  suboptima,!  checking,  we  continue  the  study  of  the  systems 
described  in  Section  6.3.2.  .Not  unexpectedly,  the  suboptimal  "nonadaptive"  SM-VVRLS  algorithm  fails  to 
track  either  system  properly.  The  result  is  similar  to  Fig.  7. 

.-\s  illustrations  of  adaptive  methods,  w'e  repeat  the  exponential  forgetting  and  selective  forgetting 
experiments  performed  above.  We  use  the  same  techniques  and  conditions  except  that  the  suboptimal 
test  is  employed.  Figure  17  shows  the  parameter  estimates,  and  Fig.  18  the  sgn{K(-)}  traces,  resulting 
from  exponential  forgetting.  Comparing  Figs.  9  and  17  we  see  that  in  the  case  of  the  fast  system,  the 
parameters  resulting  from  suboptimal  testing  track  the  true  parameters  more  accurately  for  the  first  two 
cycles,  but  then  show  signs  of  •‘breakdown"  in  the  third  cycle  unlike  the  parameters  resulting  with  optimal 
checking.  No  definitive  conclusions  can  be  drawn  from  this  comparison  of  fast  system  results.  In  particular. 

'*Note  that  a  parameter  solution  update  is  not  required,  just  the  covariance  update. 


it  should  not  be  concluded  that  suboptinial  testing  will  lead  to  faster  disintegration  of  tracking.  Indeed, 
by  comparing  Figs.  1 1  and  18.  we  see  that  there  is  a  much  greater  tendency  for  k  to  remain  positive  in  the 
suboptimal  case.  In  fact,  there  is  evidence  in  the  k  trace  that  the  failure  to  track  well  in  the  third  cycle 
is  a  transient  effect  from  which  the  identification  may  recover.  In  the  slow  system  case,  somewhat  more 
variance  is  seen  in  the  parameter  estimate  with  respect  the  optimal  checking  case,  but  this  is  apparently 
related  to  the  many  fewer  data  selected.  .Vote  the  remarkable  improvement  in  the  k  behavior  with  respect 
to  the  optimal  case,  indicating  that  the  identification  is  more  likely  operating  within  the  principles  of  SM 
theory  for  suboptimal  checking.  .Another  important  observation  is  that  the  number  of  data  sets  '.elected  by 
suboptimal  checking  is  many  fewer  (roughly  half)  that  required  by  optimal  checking.  We  currently  have 
no  e.xplanation  for  these  preferable  behaviors  of  the  suboptimal  checking  case.  but.  importantly,  they  have 
been  quite  generally  observed  across  many  simulations. 

In  a  second  experiment,  selective  forgetting  is  used  in  conjunction  with  the  suboptimal  testing.  For  this 
case,  the  parameter  results  are  practically  indistinguishable  from  those  obtained  using  optimal  checking 
(see  Fig.  16).  The  notable  difference  is  once  again  in  the  greatly  reduced  number  of  data  used  in  the 
suboptimal  checking  experiments.  For  the  fast  system  p'  =  0.022  and  6'  =  0.019.  and  for  the  slow  system 
p'  =  0.041  and  b'  =  0.028.  .Again  each  of  these  fractions  is  roughly  half  the  corresponding  figure  required 
in  the  optimal  checking  cases. 

In  summary,  we  generally  observe  that  the  suboptimal  technique  uses  about  half  as  many  data,  but 
produces  comparable  estimates  to  those  obtained  using  the  optimal  procedure.  This  is  true  whether  good 
or  bad  tracking  results.  This  means  that  not  only  does  the  suboptimal  procedure  reduce  the  comple.xity 
of  testing  the  data  sets  for  innovation  (motivation  for  its  development),  but  it  also  reduces  (by  about  a 
factor  of  two)  the  number  of  operations  spent  in  rotating  data  sets  into  the  system  of  equations.  Further, 
suboptimal  checking  frequency  results  in  more  "meaningful”  identification  in  the  sense  that  k  has  a  much 
higher  tendency  to  remain  positive.  Other  examples  using  suboptimal  checking  are  found  in  [•■12].[.'^7].[8n]. 

6.6  Convergence  Issues  and  Colored  Noise 

In  the  following,  we  return  the  the  rase  of  time-invariant  systems  and  discuss  a  few  issues  related  to 
convergence  and  colored  inputs. 

One  of  the  interesting  and  practical  bpnefifs  of  having  interpreted  UOBE  ale^^cithm  as  a  WRLS  algo¬ 
rithm  with  a  bounded  error  “overlay"  is  the  immediate  consequence  for  convergence  of  the  estimator.  It  is 
well-known  that  if  the  sequence  £.(•)  is  wide-.sense  stationary,  second  moment  ergodic  almost  surely  (a.s.). 
white  noise  (see  discussion  surrounding  (  14)1.  then  the  WRLS  estimator  &{•]  will  converge  asymptotic.dl v 
to  0.  a.s.  (e.g.  [4-5]).  In  the  present  case.  wr>  need  only  to  add  the  qualifier  that  the  FOBE  algorithm  iiDt 
cease  to  accept  data  in  order  to  lay  claim  to  this  useful  result. 
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Likewise,  we  may  even  assert  a.s.  convergence  of  the  VVRLS  estimate,  albeit  to  a  bias,  when  e.(-)  is 
colored  and  persistently  exciting'^  (p.e.)  [47].  Even  in  the  presence  of  colored  errors,  therefore,  as  long 
as  the  acceptance  of  data  does  not  cease,  and  the  "sampling'’  induced  by  data  selection  does  not  interfere 
with  the  p.e.,  we  may  expect  the  L'OBE  estimate  to  converge. 

It  would  be  interesting  to  have  a  precise  understanding  of  the  asymptotic  behavior  of  the  hyperellipsoidal 
feasible  set.  especially  in  the  case  of  colored  noise.  Knowledge  that  the  ellipsoid  is  vanishing  (white  noiae), 
or  becoming  as  small  as  possible  (colored  noise),  could  be  very  useful  information  indeed.  In  the  white 
noise  case,  a  sufficiently  small  ellipsoid  could  serve  as  a  reinforcing  indicator  of  convergence,  and  offer  a 
means  of  determining  error  bounds  on  the  estimate.  In  the  colored  case,  a  small  feasible  set  (known  to 
contain  the  true,  unbiased  estimate)  could  be  indispensible.  Unfortunately,  a  convergence  proof  for  most 
instances  of  the  UOBE  algorithm  is  not  forthcoming.  The  original  OBE  paper  by  Fogel  and  Huang  [41]  is 
sometimes  misunderstood  to  indicate  the  convergence  of  the  bounding  elbpsoid  to  a  point  under  ordinary 
conditions  on  £.(■).  In  fact,  the  paper  only  proves  this  convergence  for  the  case  of  unity  weights  so  that 
the  fundamental  optimization  process  is  not  taken  into  account.  .\o  known  proof  of  this  desirable  result 
for  the  F-H  OBE  algorithm,  or  for  any  instance  of  UOBE  with  causal  scaUng  exists,  whether  optimal  or 
suboptimal  checking  is  used.  However,  it  can  be  shown  for  the  volume  algorithm  with  causal  scaling  (see 
Corollary  4  in  .Appendix  .A.l)  that  if  an  optimal  weight  exists  at  time  n.  then  if  the  data  set  is  included 
using  this  weight,  then  the  volume  will  certainly  decrease; 


fi,.(n)  <  n^,(n-  1 )  . 


(74) 


This  indicates  that  the  ellipsoid  volume  will  converge  to  some  unspecified  size  in  some  unspecified  manner. 
■A  similar  result  can  be  demonstrated  for  the  trace  algorithm  [80]. 

In  spite  of  this  encouraging  result,  one  of  the  drawbacks  of  the  volume  approach  is  that  the  set  measure 
p,,  is  not  a  proper  "metric"  in  the  parameter  space.  By  this  we  mean  the  following:  Suppose  we  propose 
the  distance  measure  d  such  that  at  time  n.  d{0{n).0.)  =  //,,,(n).  We  immediately  find  that  d  fails  to  Im 
a  proper  metric  since  d(0(n),0, )  =  0  does  not  imply  that  0(n)  =  0..  This  unfortunate  situation  arises 
because  the  elfipsoid  may  potentially  degenerate  and  reside  in  a  subspace  of  .  thereby  achieving  zero 
volume  without  being  reduced  to  a  point,  .\ccording  to  .\ayeri  et  nl.  [80],  this  will  likely  only  occur  if 
p.e.  is  not  achieved,  and  is  therefore  a  more  important  problem  with  colored  disturbances.  This  potential 
anomaly  provides  motivation  to  consider  the  use  of  the  trace  measure  for  which  a  degenerate  ellipsoid  will 
not  produce  a  zero  set  measure. 

It  has  freqtiently  been  noted  is  that  the  hyperellipsoidal  bounding  sets  resulting  from  UOBE  algorithms 
can  be  quite  "loose"  supersets  of  the  exact  feasibility  sets  (polytopes)  (e.g,  [S].[84]).  particularly  in  "finite" 


Plea.se  read  the  abbreviation  "p  e  '  as  persitently  exciting"  or  "persistency  of  excitation."  a.s  appropriate. 
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However,  many  simulation  studies  in  the  literature  (white  noise  case)  have  shown  the  volume  of  the 
ellipsoids  to  become  quite  small  in  the  "long  term.”  Further,  as  we  and  other  researchers  have  demonstrated, 
the  empirical  convergence  and  tracking  properties  of  the  I'OBE  estimator  are  favorable  in  spite  of  the  few 
data  used.  This  is  an  indication  that  the  prfsenrr  of  the  ellipsoid  and  the  optimization  procedure  centered 
on  it,  are  quite  useful  for  signal  processing,  regardless  of  our  present  inability  to  completely  understand 
its  behavior  in  theory.  The  results  presenti'd  above  offer  further  support  for  "good  behavior"  of  this  class 
of  algorithms  by  indicating  that  the  ellipsoid  measures  will  converge  to  some  unspecified  size  in  some 
unspecified  manner.  This  result  has  not  been  clearly  understood,  and  its  finding  offers  some  hope  that  a 
proof  of  convergence  (in  some  sense)  for  the  I'OBE  algorithms  may  be  found  in  the  white  noise  case. 

The  D-H  OBE  algorithm  [2-1]  has  been  cited  above  as  an  instance  of  I'OBE  which  does  e.xhibif  con¬ 
vergence  of  the  ellipsoid  under  usual  "white  noise”  e.xcitation  conditions.  From  the  UOBE  point  of  view, 
the  trick  employed  by  Dasgupta  and  Huang  is  to  use  a  noncausal  scaling  sequence  which  is  also  pointwise 
optimized  in  a  certain  sense.  In  i)articiilar  (,‘(n)  =  (1  -  A*(n))~'  for  each  n,  so  that  the  previous  weights 
are  also  modified  in  a  nrmner  which  is  consistent  with  the  optimal  objective^’  of  minimizing  K(n).  This 
choice  of  scaling  sequence  and  minimization  criterion  admits  the  clever  use  of  Lyapunov  theory  to  obtain 
the  convergence  result.  .As  we  have  also  indicated  above,  the  checking  criterion  for  the  D-H  OBE  method 
is  of  Olrn)  complexity  per  a.  adding  another  attractive  feature.  These  theroretical  and  computational 
benefits  notwithstanding,  in  published  simulation  studies,  this  method  has  not  been  shown  to  exhibit  any 
sign’ficant  advantage  in  estimation  or  tracking  with  respect  to  the  adaptive  ,SM-WRLS  methods,  for  ex¬ 
ample.  discussed  above.  -\s  we  have  also  discussed  above,  the  inability  to  determine  the  precise  meaning 
of  the  optimization  criterion  leaves  open  some  fundamental  questions  in  the  interpretation  of  the  ludiaviui 
of  the  method. 

Fintillv,  we  note  that  some  work  with  colored  distnrbance.s  ha.s  Iteen  re[)orted.  The  D-H  OBE  algorithm 
has  been  extended  to  the  case  of  an  .\RM.\  iSISO)  model  !)y  Rao  et  nl.  in  [99]-[l01].  In  this  rase  the  error 
is  filtered  by  a  linear  ( .\f.A )  filter  creatins:  a  colored  noise  secpienre.  say. 

a  )  =  S.l  /) )  -e  ^  a  -  ( ).  (71) 

1=1 

Rao's  approach  is  to  estimate  the  unobservable  sequence  ;-.( ■ )  by  t  he  errors  s{n  —  !.9(n  —  i)).  (=0.1 . r 

at  time  T).  then  use  the  D-H  OBE  developments.  Error  hounds  on  £.(■)  are  not  sufficient  to  bound  s'.(  -).  so 
that  the  ellipsoid  is  no  longer  guaranteed  to  contain  the  true  parameters.  .A  condition  on  the  6,.  paramei<'rs 
is  determined  such  that  this  violation  does  not  occur.  .\'ot  surprisingly,  the  condition  implies  a  rest rii  i ion 
on  the  amount  of  correlation  u  hich  can  he  induced  bv  the  M.A  filter. 


Norton  has  proposed  the  use  of  inner  bounds  as  a  possible'  remedy  for  this  problem 

Recall  that  this  is  not  the  minimi7ation  of  the  F-llipst^id  sizp  ptr  <f  disrussiun  in  Sections  J.J  and 
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Related  work  is  found  in  the  paper  by  Norton  [84]  in  which  the  ARMAX  model  is  studied.  In  this  work, 
the  effects  of  the  coloring  of  the  noise  upon  the  bounds  is  studied  both  analyticaUy  and  experimentally. 
The  results  indicate  the  possibility  of  non-convex  bounds  on  the  true  feasible  set,  f2(-),  in  the  colored 
noise  case.  ,\n  attempt  is  made  to  relate  these  anomalies  to  the  bias  that  occurs  in  conventional  WRLS 
processing  due  to  the  colored  noise. 

Nayeri  et  al.  [80]  have  argued  that  the  ellipsoid  must  remain  nontrivial  in  the  colored  noise  case  (i.e. 
lim„_oon(n)  ^  {©.})  and  have  conjectured  that  the  0,  will  appear  on  the  boundary  of  the  limiting 
ellipsoid  when  £.{■)  is  p.e.  Some  interesting  effects  of  non-p.e.  disturbances  alluded  to  above  are  also 
studied  in  the  cited  paper. 

6.7  Parallel  Hardware  Implementations 

One  of  the  advantages  of  the  QR-VVRLS-based  UOBE  formulation,  and  the  feature  which  motivated  its 
development  [26],  is  that  it  immediately  admits  solution  by  contemporary  parallel  architectures.  This  is 
critical  because  it  reduces  the  complexity  of  the  optimal  algorithm  from  O(m^)  to  0(m),  where  m  is  the 
number  of  parameters  to  be  estimated.  The  significant  reduction  of  computational  complexity  and  parallel 
hardware  implementation  of  SM  algorithms  improve  their  potential  for  real  time  applications.  Systolic 
architectures  for  both  nonadaptive  [31]  and  adaptive  [87], [88]  versions  of  the  SM-WRLS  algorithm  have 
been  developed  by  Odeh  and  Deller.  The  adaptive  architecture  has  somewhat  more  complex  cells,  but 
the  computational  savings  with  respect  to  sequential  solutions  is  identical.  The  complexity  of  the  parallel 
computation  is  given  by 

/  opt  ~  C7(3m)  + /)C>(  11m)  flops  per  n  (7.5) 

parallel 

if  the  optimal  checking  is  implemented,  where  p,  as  above,  is  the  fraction  of  the  data  accepted  by  the  SM 
considerations.  If  suboptimal  checking  is  employed,  the  average  count  is 

f  subopt  ^  0(  m)  +  p'0(llm)  flops  per  n.  (76) 

parallel 

where  p'  likewise  indicates  the  acceptance  ratio.  When  adaptation  by  back-rotation  is  added  to  either 
strategy,  and  additional  bO(  11m)  (or  b'0{  1  Im))  flops  per  n  are  required  on  the  average,  where  b  and  b'.  as 
above,  indicate  the  average  number  of  back-rotations  computed  per  n  in  the  optimal  and  suboptimal  cases. 
Note  that  these  tallies  represent  parallel  complexities  in  the  sense  that  they  denote  the  effective  number 
of  operations  per  n.  though  many  processors  can  be  performing  this  number  of  operations  simultaneously. 
.Accordingly  the  parallel  comple.xlty  indicates  the  time  it  takes  the  parallel  architecture  to  process  the  data 
regardless  of  the  total  number  of  operations  performed  by  the  individual  cells. 

I'nlike  the  sequential  algorithms,  scaling  may  be  added  to  the  parallel  processors  (to  implement  the 
F-H  OBE  algorithm,  for  example)  at  virtually  no  computational  cost,  but  at  the  negligible  hardware  cost 
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of  m  multiplication  units. 

The  parallel  architectures  described  in  papers  cited  above  are  developed  for  real,  scalar  observations, 
but  can  be  used  for  complex  scalar  observations.  The  necessary  modifications  are  concerned  with  the 
ba.sic  Givens  rotation  operations.  These  are  elementary  and  are  found,  for  example,  in  [48].  However,  the 
general  complex  vector  observation  case  of  the  SM-WRLS  (UOBE)  algorithm  is  not  readily  mapped  into 
similar  architectures.  The  generalized  architecture  that  efficiently  implements  this  case  has  not  yet  been 
developed. 

7  Conclusions  and  Further  Issues 

The  emerging  field  of  SM-based  signal  processing  is  receiving  considerable  attention  and  is  becoming 
increasingly  popular  around  the  world.  In  this  paper,  we  have  given  a  general  review  of  SM  theory  and  a 
broad  coverage  of  general  SM  algorithms  and  related  topics.  The  majority  of  this  paper  has  been  concerned 
with  a  class  of  SM  algorithms  for  estimating  the  parameters  of  linear-in-parameter  system  or  signal  models 
in  which  the  error  sequence  is  pointwise  "energy  bounded."  Specifically,  we  have  focused  on  the  case  of 
ellipsoid  algorithms  which  have  been  shown  to  represent  a  blending  of  the  classical  LSE  methods  with  the 
BE  constraints. 

The  combined  LSE/BE  algorithm  has  been  formulated  as  a  UOBE  strategy  which  embraces  aU  reported 
algorithms,  adaptive  and  nonadaptive.  Within  this  framework,  a  flexible  strategy  based  on  “back-rotation" 
has  been  proposed  to  make  the  UOBE  algorithms  specifically  adaptive.  The  adaptive  strategies  as  well  as 
the  nonadaptive  cases  performed  well  in  simulation  trials. 

In  general,  SM  approaches  are  interesting  because  they  produce  sets  of  feasible  solutions  based  on 
tenable  assumptions  where  no  unique  solution  may  otherwise  exist.  In  the  signal  processing  (LSE)  domain, 
a  unique  solution  exists,  but  the  set  provided  by  UOBE  is  interesting  from  two  points  of  view.  First,  the 
feasible  set  may  complement  the  unique  LSE  solution  in  cases  in  which  the  ordinary  asumptions  about 
the  model  error  are  tenuous  (for  example,  where  the  model  noise  is  colored).  Secondly,  from  the  feasible 
set  arises  an  interesting  data  selection  technique  which  can  lead  to  significant  computational  complexity 
improvement.  UOBE  algorithms  typicaOy  reject  70  -  95%  of  the  incoming  data  sets  because  they  fail  to 
refine  the  existing  ellipsoidal  feasible  set  in  some  sense.  This  should  not  be  misintepreted,  however,  to  imply 
a  70  -  95%  load  improvement.  In  fact,  certain  constraints  must  be  observed  to  achieve  more  than  a  gain 
of  about  five  in  complexity  improvement  with  respect  to  conventional  WRLS.  If  a  sequential  computing 
n.achine  is  to  be  used,  then  suboptimal  checking  (for  feasible  set  refinement)  must  be  used.  A  method 
suggested  in  this  paper  has  been  found  to  perform  quite  well,  and  yields  0(m)  complexity  compared  with 
(at  least)  0(m^/2)  for  the  optimal  algorithm.  This  lowered  complexity  can  be  preserved  with  adaptive 
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strategies  which  do  not  require  excessive  reiteration  over  the  past  data  sets.  Secondly,  scaling  factors  (C(-), 
see  Section  4.2),  including  exponential  forgetting  factors,  cannot  be  used  except  at  0{m? j'!)  expense.  No 
compelling  reason  for  the  use  of  such  factors  has  been  observed  in  simulation  studies  in  the  literature.  (.4 
theoretical  argument  e.xists  for  the  use  of  scaling  factors  in  the  D-H  OBE  algorithm  [24],  In  this  case,  the 
scaling  strategy  leads  to  convergence  of  the  ellipsoidal  bounding  set  in  a  certain  sense.)  Finally,  a  parallel 
processing  version  of  the  UOBE  method  has  been  presented  with  which  to  achieve  C>(m)  complexity  under 
virtually  any  condition  of  scaling,  adaptation,  optimal  or  suboptimal  checking.  Real  applications  of  these 
identification  techniques  will  benefit  when  these  relatively  simple  architectures  can  be  dedicated  to  the 
process.  It  is  interesting  to  note  that  the  infrequent  updating  which  results  as  a  consequence  of  the  UOBE 
considerations,  may  lead  to  strategies  for  time-sharing  of  these  parallel  processors. 

The  simulation  results  presented  illustrate  important  points  about  the  various  UOBE  methods  and 
show  that  the  adaptive  algorithms  yield  accurate  estimates  using  very  few  of  the  data  and  quickly  adapt 
to  fast  variations  in  the  signals  dynamics. 

Some  of  the  key  theoretical  results  underlying  the  UOBE  class  of  algorithms  appear  in  the  appendices 
of  this  paper.  These  appendices  unify  many  theoretical  results  found  in  the  literature. 

.Many  interesting  open  research  problems  remain  in  SM-based  signal  processing.  .A.mong  them  are  the 
pursuit  of  different  adaptation  strategies,  refined  hardware  solutions,  and  a  world  of  other  challenges  that 
will  emerge  as  these  exciting  new  techniques  continue  to  be  applied  to  practical  problems.  .4s  computing 
power  continues  to  increase,  many  of  the  more  complex  error  bounding,  and  other  SM.  algorithms  will 
begin  to  attract  more  attention  of  signal  processing  engineers.  In  this  sense,  the  techniques  upon  which 
this  paper  has  focused  may  ultimately  comprise  a  very  small  part  of  the  overall  impact  which  SM-based 
techniques  will  have  on  the  signal  processing  technologies. 
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A  Appendices 


These  appendices  present  results  which  will  rigorously  support  informal  arguments  made  in  the  main 
text.  For  generality,  we  will  include  a  scaling  sequence  (of  the  form  (22)  unless  otherwise  noted)  in  the 
WRLS  recursions.  "Unsealed”  results  are  obtained  by  simply  dropping  subscripts  "s”  or  setting  i^{n)  =  1 
wherever  it  occurs.  Without  loss  of  generality,  however,  we  shall  not  explicitly  include  the  downdating 
process  for  adaptation  which  was  developed  in  the  paper.  If  the  solution  at  time  n  —  1  is  to  be  downdated 
prior  to  consideration  of  (r/(n),  x(  n)).  then  all  quantities  implicitly  or  explicitly  involving  past  data  will 
be  modified,  and  then  will  enter  the  developments  in  precisely  the  same  way  as  their  "un-downdated" 
counterparts.  In  this  case,  for  example,  every  occurence  of  G's(n)  =  x^(n)C~^[n  —  l)a:(n)  should  be 
replaced  by  G,i^s{^)  =  xJ -  l)a;(n). 

A.l  Propositions  and  Corrollaries 


Proposition  1  Let  Q{n)  C  be  the  feasibility  set  arising  from  BE  constraints  as  in  (-9J.  Given 

observations  on  time  range  t  €  [l.  n],  let  Q{n)  denote  the  weighted  LSE  estimate  with  associated  covariance 
matrix  C(n).  The  weights  used  in  the  fstimation  are  A„(t)  xcith  A„(l)  >  0.  There  exists  a  hyperellipsoidal 
set  of  parameter  vectors.  Q(n)  C  .  such  that  ©,  G  C  (l(n).  which  is  given  by 

fi(n)  =  I©  I  fr|[©  -  e{n)f 

where, 

K{n)  =  tr{©"(n)C(n)©{n)}  +  X,ft)j(t){\  -  .-'(f)  ||  y{t)  ||^).  I?."!) 

(=1 


C(».] 

n(n) 


[©-©(n)j  <  1 


©  G  C’ 


xky.k 


(77) 


Remark:  When  k  =  I  this  the  result  ot  Proposition  1  reduces  to  a  generalization  of  the  .\1IS0  case  result 
found  in  many  papers  in  the  literature.  When  k  1,  a  hyperelliposidal  bounding  set  is  also  associated 
with  each  scalar  component  of  the  output  vector  as  we  show  in  the  following  corollary. 

Corollary  1  Under  the  conditions  of  Proposition  I.  feasible  parameter  vectors  associated  with  output  (/, 
(column  i  of  0).  say  0,,  are  confined  to  a  hyperellipsoidal  membership  set.  say  klfn).  which  is  centered 
on  its  current  weighted  LSE  estimate. 

il,(n)=\e,  I  [0, -0,(„)]^'^[<>,-0,(«)]<  l|  -  (79i 

I  n(n )  ) 

Remark:  This  means  simply  that  there  i.s  a  hyperellipsoidal  domain  in  the  parameter  subspare  which 
contains  all  possible  parameter  vectors  and  which  is  centered  on  the  WRLS  estimate.  .\ote  that  the 
ellipsoid  associated  with  each  ty,.  t  =  1.2 . k.  is  identical  to  all  others  except  for  its  center. 


o'2 


Proposition  2  If  it  exists,  the  weight  A*(  r)  which  minimizes 


1.  the  volume  measure  fj,y(n)  is  the  unique  positive  root  of  the  quadratic  equation 

f'vl  A)  —  a^A^  +  fli  A  4-  ao  =  0  ( ^0) 

where.  ai  —  {(mk  —  l)~i{n)Gl(n]} . 

ax  =  {{2mk  -  1)  +  ||  e{n.0(n  -  1))  H'^  -n^in  -  \  )'/~\n)Gs{n)}  ■r(n)Gs{n), 

and  ao  =  mk  ["/(n)-  I1  e:{n.0{n  -  1))  H’-^]  -  Ks(n  -  l)6'5(n),- 

2.  the  trace  measure  pt(n)  is  the  unique  po.sitive  root  of  the  cubic  equation 

^((  A)  =  63  A^  +  62A^  +  6i  A  +  60  { 1 ) 

with  63  =  -f{n)G'l{n){Gs{n)  -  /,( «  -  l)Hs{n))  , 

62  =  'i'f{n)Gs(n)[Gs{n)  -  I^in  -  l)Hs{n)]. 

b\  -  Hs(n)Gs{n)Is(n  -  !)«.,(?*  -  I)  -  2f1s{n)Is(n  -  1 )  [7(72)-  ||  £(  w,  ©(ra  -  1 ))  j|^] 

-Gs(n)  II  £(w.0(n  -  1))  11^  +3-x‘{n)Gs(n). 
and  60  =  7(”)-  ||  £{n,0{n  -  1))  |p  -Hs(n)Is(n  -  l)Ks(n  -  I), 

where  Gs{n)'‘‘=  x'^{n)C~^{n  -  l)ic(n),  Hs(n)  "=  x^{n)Cj^(n  ~  2)x(n),  and  Is{n)=  trC,{n). 

Remark:  Many  of  the  inherent  scale  factors  in  the  coefficients  above  cancel,  but  for  practical  implementa¬ 
tion  it  is  more  useful  to  express  the  coefficients  vvith  the  scaled  quantities  included.  By  canceUing  the  scale 
factors,  however,  the  following  can  immediately  be  observed  for  either  optimization  criterion:  If  A* 
denotes  the  optimal  weight  (or.  in  fact,  any  root  of  the  polynomial)  at  time  n  with  scaling,  while  A"(u) 
denotes  the  weight  resulting  if  no  scaling  takes  place,  then  A*  j(n)  =  A* ( 77)(,‘( u). 

Corollary  2  There  is  an  inherent  hyperellipsoid  associated  with  the  D-H  OBE  algorithm  whose  volume 
at  time  n  would  be  minimized  by  the  positive  root  of  the  quadratic  F'„(X)  =  a.^X^  -|-  a'xX  +  Oq  where  a,  = 
rt2  +  Uq  -  fli.  ^1  =  «!  —  ‘2ao,  and  Oq  =  uq.  where  a,,  i  =  0.  1,2  are  defined  as  in  Proposition  2. 


Remarks: 

1.  Interestingly,  the  quadratic  in  Corollary  2  ran  be  obtained  by  using  the  scale  factor  C('0  =  (1  - 
A’(n))~'  in  the  results  of  Proposition  2.  That  this  should  be  true  is  not  obvious  because  of  the 
nonlinearities  in  A  which  are  created  by  these  scale  factors. 

2.  .\  similar  result  likely  obtains  for  the  trace  rase. 

d.  The  utility  of  this  result  remains  an  open  question  because  to  use  weights  which  are  optimal  in  this 
sense  does  not  necessarily  admit  the  convergence  results  obtained  by  the  Dasgupta-Huang  analysis. 


Corollary  3  Consider  the  COBE  algorithm  uith  simple  scale  factors  and  volume  optimization.  If  an 
optimal  weight  exists  at  time  n,  then  its  use  will  certainly  diminish  the  volume,  py{n)  <  pv[n  -  1). 

Remark:  A  similar  result  can  be  obtained  for  the  trace  measure  [80]. 


Proposition  3  K(n,Xn{n))  has  the  following  properties:  I.  On  the  interval  A„(n)  £  (0,  cc),  K{n.  \n{n)) 
is  either  monotonically  increasing  or  it  has  a  single  minimum.  2.  K{n.  Xjfn))  has  a  minimum  on  A„(n)  £ 

!]  £(;i,0(n  -  1))  ||•>  7(ra).  (82) 


Proposition  4  Consider  the  UOBE  algorithm  with  causally  scaled  weights  as  in  (22).  Then,  if  suboptimal 
check  (82)  holds,  a  positive  optimal  weight  ixists  for  either  the  volume  or  trace  algorithm. 

Remark:  The  D-H  OBE  [24]  algorithm  uses  asiinilar  test  (see  (71)),  derived  by  very  different  arguments. 
See  the  discussion  in  Section  6.5.2  for  interesting  similarities  between  Proposition  4  and  D-H  OBE. 

A. 2  Lemmas 

Lemma  1  Condition  (29)  implies 

n  n 

II  £-(/)  f<Y,^n(t)l{t)  (83) 

t=i  t=i 

for  any  non-negative  (real)  sequence  A^(-).  The  equality  can  he  removed  for  n  >  to,  where  to  is  the  minimum 
t  for  which  Xn(t)  ^  0. 


Lemma  2  The  scalar  sequence  k(-)  of  (US)  can  be  computed  recursively  in  two  ways: 
I.  In  the  context  of  MIL-WRLS: 

A„(u)  II  e[n.Q(n  -  1))  ||'^ 


k{  n)  =  K,{n  -  1)4-  A„(  n)',(n) 


with  K^i  n  —  \  )  ='  K{  n  —  1 )/(,'( n  -  1 )  and  ^^(0)  '=  0. 


4“  A,j(u)05(n) 


(84) 


In  the  context  of  QR-W'RLS:  Let  T(  n)0(n)  =  D\  represent  the  triangular  .system  of  equations  to  be 
solved  at  time  n  and  let  8\(u)  =  tr  {D\^ (n)D\(n)}.  Then. 


with 


K(  n )  =  ^ i(  f) )  +  «(  «) 

k{n)  =  K,,i  n  -  1)4-  Xy(n)-){n)  \  -  l~\n)  \\  y(n)  ||'^) 


('^o) 

(86) 
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where  k^n  —  1)  is  defined  as  above. 


A. 3  Proofs 


(87) 


(88) 


Proof  of  Lemma  1:  That  the  equality  holds  for  n  <  (q  is  obvious.  At  to 

^n(io}  11  £.(^o)  |1^<  ^n(lo)lito)- 
Since  A„(<)  =  0  for  t  <  to,  (87)  may  be  written 

^A„(0  1|  e.(nir'<EA„(f)7(0  • 

f=i  «=i 

Sequentially  add  inequalities  A„(<)  ||  e.(<)  ir^<  A„(07(0  for  t  =  to  +  I . n,  noting  that  the  inequality 

between  the  sums  is  preserved. 

Proofs  of  Proposition  1  and  Corollary  1:  Upon  writing  ||  £.(0)  as  tr  |e.(t)£i^(t)|.  it  follows 
immediately  from  Lemma  1  that 

^  A„(0  tr  I  [y(t)  -  e.^xit)]  [i/{0  -  &!Jx(t)]  |  <  ^  Kithit)  . 

This  constrains  the  possible  parameter  matrices  to  the  set 

n(n)=|0  1  ^A„(f)  tr|[i/(O-0"a:(t)]  [i/(0-©"a;(t)]'^|  <  f^A„(t)7(o|  • 
Expanding  the  trace  term. 

Q(n}  =  I©  1  ^A„(0  tT[y(t)y^{t)-0^x(t)y'^(t)-y(t)x^{t)0  +  0^x(t)x^(t)0] 


89) 


(90) 


«=i 


n  1 

t=i 


(91; 


(921 


98) 


.Moving  the  summation  across  terms, 

n(u)=^0  I  trlCt/ln)  —  0^  C  xy(  ~  ©xj/l  C")  n  )0|  <  A„(  t  )7(<  )^ 

where  definitions  of  Cxy(')  and  Cy{-)  are  inherent.  Since  Cxy(^t)  =  C{n)0{j}), 

C^y{n)  =  0"(n)C"(n)  =  0^[n)C[n). 

This  substitution  in  (92)  and  some  simple  manipulation  yields 

M(n.)=|0  1  tr  {0^C(n)0  -  0^C(n)0(R)  -  0'^(n)C(n)0}  <  ^An(07(n  -  tr{Cy(u)}|  (04) 

Completing  the  square  on  the  left  side  yields 

Qin)  =  I©  I  tr{0"C(7O0  -  0^C{n)0{n)  -  0^{n)C{n)0  +  0^ {n)C{;i)0(n)] 

<  X  A„(f)7(0  -  tr{Ct/(n)}  +  tr  |0^( n )C( n )0{  7i)|  =  K(n)|  (9.V) 


«=i 


0-0 


The  definition  of  K[n)  in  (95)  is  seen  to  be  equivalent  to  that  given  in  (78)  by  noting  that 
tr  {Cy(n)}  =  X!!r=i  ^n{t)  II  y{t)  11^-  It  follows  that  the  set  is  described  by 

fi(n)={©  I  tr|[0  -  0(«)]^C(n)[0  -  ©(n)]}  <  K(n)}  .  (96) 

Since  C{n)  is  positive  definite  almost  surely,  the  left  side  of  this  inequality  must  be  a  positive  number. 
Therefore  K(n)  >  0.  Dividing  both  sides  by  K(n)  yields  (77).  □ 

To  prove  Corollary  1,  it  is  convenient  to  write 

k 

tr{[0  -0(«)]^C(/i)[0  -  0(n)]}  =  (97) 

j  =  i 

where  Cj  indicates  the  diagonal  element  of  [0  -  C{n)[©  -  0(u)].  Now  it  is  clear  that 

c,  =  [0.  -0,(«)]"C(n)[0.  -0,(»)1  (98) 

for  any  i,  where  0,  and  6i{n)  are  the  columns  of  ©  and  0(n),  respectively.  It  is  also  true  that  all  the 
Cj's  are  positive  since  C(n)  is  a  positive  definite  matri.x.  Therefore, 

k 

c,  <  ^  Cj  <  K(n)  for  any  t  £  [1.  A;]  .  (99) 

j=i 

Dividing  through  by  K{n)  yields  inequality  (79).  □ 

Proof  of  Lemma  2:  Case  1.  Inserting  the  right  side  of  (25)  into  (78)  for  0(n)  gives 


K(n)  =  tr  I  0(n  -  1 )  +  A„(  n)C  '(  n)a:(ti)e^(n,0(n  -  1  ))J  C{n) 

n 

X  [©(n  -  1)  +  Xn{n)C~'(n)x(n)€^ {T1.0{71  -  1))]}  +  ^  A,dO'>(C  (l  -  7”’(C  II  2/(0  ||‘)(  100) 

=  tr  |0^( «  -  1 )  -  1 )  +  A^(  n)x(  n)x^  { n)  ©(n-l)  + 

An(  n  )e(  n,  0(n,  -  1  ))x^(ri)0(  u  -  1 )  +  A„(  n  )0^{  u  -  \  )x[n)e^  {n.0{ri  -  1 ))  + 


\^{n)e(n,0(n  -  [))x^(n)  *(  n  -  1 )  -  A„( 't ) 


C7‘(u  -  l)a;(u)a;^^(n)C7'(u  -  1) 

1  +  Xn{n)Cjs{n) 


x{  n  {n.0{i>  -  1 ) 


n 

+  -  7~‘  II  y{t)  11^) 


K,(n  -  1 )  +  A^(n)7(n)  -  A„(n)  |1  y{n)  ||^  + 

^nin)  tr|[T/(n)  -  e{n.0{n  -  l))][y(n)  -  e{ti.0{u  -  1))]^^  +  e(u.0(u  -  l))[y(n)  -  e(u.0(n  -  1))]' 

r  ,  ,  ,  H,  ,  A^(u)e(n.0(/)  -  l))e^'(n.0(n  -  1))6’ .,(»)] 

[y(n)  -  £{n.0{n  -  1 ))]  0(  u  -  1))  +  - .  .  ,  i 


1  +  A„(n)C7(n) 


,(n  -  i )  +  A^{  rt  }',(n)  -  X„(  n }  tr  ^£(  n.  0(n  -  I  ))e’U  «.©(«-  1  ))|  1  - 


Xn{n)C,(n) 

1  +  A„(  n)Gs(r7) 


Equation  (84)  follows  immediately  upon  recognizing  the  trace  term  to  be  ||  e(n,0(n  -  1))  ||^. 


Case  2.  Because  T{n)  is  obtained  by  an  orthonormal  transformation  of  X(n)  (see  ( 19)). 

tr{0^(tt)C(7z)0(n)}  =  tr{0^(n)T"(«)T(n)0(7i)j  =  tr  {Z)(^(  n)D,  (n)}  . 

This  is  the  first  term  in  the  basic  expression  for  K(n)  in  (78).  The  second  term  can  be  written  as 


An_i(07(n  (1  -  1  '(0  II  y(t)  11'^) 
C(n  -  1) 

The  desired  recursion  follows  immediatelv. 


+  A„(n)7(n)  (l  -  7  '  |1  y{n)  |1'^)  . 


Proof  of  Proposition  2:  For  simplicity,  lot  us  denote  A„(n)  by  A  throughout  the  proof. 


Volume  case.  Define 


B{n)  =  K{n)C  *(n) 


We  wish  to  minimize  /Xu(n)  =  dot  B{n),  and  it  will  be  convenient  to  do  so  by  minimizing  the  ratio^^ 

Hy{n)  detB(n)  B(n) 


(  1  'll'  _  detB(n)  _  B(n) 

ti  -  1 )  det  Bill  ~  1)  ^  B(n  -  1) 


From  (24)  and  ( 106) 


B{n)  _  B(n  -  1)  ^B{n  -  l)x{n)x^(n)B(n  -  \) 

K(n)  K,(n-l)  K|(n  -  1)  [1  +  AG^ln)] 

_  B{n  -  -  1)  -  l)B{n  -  \)x(n)x^ (n)B(n  -  [) 

«(n-l)  k;'^(r  -  1 )  (1  +  AG'^l  n)] 

_  B(n  -  1)  ^B{n  ~  [)x{n)x^{n)B(n  -  \) 

K^n  -  1)  K^(n  -  1)[1  +  ACGln)] 

Defining  /i(  n)  =’  1  +  AG,!  n)  and  r{n)  ='  k(  7i)/K^{n  —  I )  yields 

B(n)  =  B{n  -  l)r(n)  il - [B(n  -  l)3;(f?)]"|  . 

(  K.,(n  -  l)n{n)  J 


/  .  B{n)  (  r(n)Xx{n)  , 

i^vin)  =  det  — - - -  =  det  \r{n)I - - -  £f(  n  -  1  )x{  n )  " 

B{n  -  1)  (  K,(n  -  1 )/((«) 

Using  the  matrix  identity  [.37]  (for  the  complex  case) 


det( of  +  )  =  c"*^  *(r  +  tt^z 


where  r.  z  £  and  r  is  a  real  number,  we  obtain 


r(n)Aat^^n) 


lyyin)  =  r’"'  (n)  i  r(n) - -  ■  B{n  -  l)a;(n; 

I  K,(n-l)/((n) 


^Recall  from  Section  6.1  that  B{n  —  1)  may  be  considered  either  the  volume  of  the  scaled  or  unsealed  ellip.soid  al  time 


This  can  be  written  eis 


i'v(n)  =  r"'^(n)  ^  1  - 


\Gs{n)\  r’^^(n) 


h(n)  J 


h(n) 


Therefore,  to  minimize  I'yin)  with  respect  to  X.  ( 113)  is  differentiated  and  the  result  is  set  to  zero. 


di^v(n) 


~mk 


in) 


mkr^‘‘  Un)dr(n)  r 


mki 


dX  dX  y  h{n)  J 

Since  >  0  (see  proof  of  Proposition  1 ). 

h^{n) 


h(  ;i) 


dx 


h-in' 


-GAn) 


r*mk~’\ 


(n)  dX 


mkh(n)—\~  -  r{n)Gs[n). 
oX 


Now  using  Lemma  2  we  can  write 


r(u)  = 


k(  n  \ 


=  1  + 


Xfin)  X  II  £{n,&in  -  1))  || 


I '2 


K,(n  —  1 )  kA  u  —  1 ) 

Differentiating  this  result  with  respect  to  A  yields 


KAn  -  l)h{n) 


dr(  n) 

dx 


I 


KAn -  1 ) 


l(n) 


e(n,0(n  — 


h^(n) 


Putting  this  result  in  ( 115)  and  replacing  r(n)  with  the  right  side  of  ( 116)  yields 


h^{n)  dvvin)  mkh(n) 


r.m-  1 


in)  ox 


K.s(  n  -  1 ) 


7(n)- 


e(H.©(n  -  1))  11'^ 


1  + 


X',{n)  A  II  e(n,0(n  -  1))  II' 


K,(  71  -  I  ) 


KAn  -  1  )h(n] 


hHn) 


After  some  algebra. 


KjI  n  —  l)h^(  77 )  di^A  n) 


J.mk  17,jj 


ox 


ink  ■)(n)h^{n)—  ||  e(n.&(n  — 


^KAn  -  l)h(n)  +  X-){n)h(n)  -  A  ||  e( 77,0(77  -  1))  ||'|  Gs(n} 

When  h(77)  is  replaced  by  ( 1  +  XGAn))  on  the  right,  the  following  result  is  obtained 

KAn  -  1  )h^{n)  di>An) 


r'''“'(u)  dX 


=  T.,.(A) 


(113) 


(114) 


(115) 


(116) 


(IP 


(118) 


(11!)) 


(120) 


where  F,A  A)  is  exactly  the  quadratic  of  (SO).  Since  the  factor  in  front  of  the  derivative  on  the  left  is  positive 
for  any  positive  A.  a  positive  root  of  F,.(A)  corresponds  to  {OnAn)/dX)  =  0. 

It  is  noted  that  the  discriminant  of  the  quadratic  is  always  positive  so  that  the  roots  are  always  real. 
Moreover,  when  ao  >  0  it  is  found  that  ni  >  0  as  well.  Since  a>  is  always  positive,  this  implies  that  the 
roots  are  both  negative,  since  no  positive  A  satisfies  (SO)  in  this  case.  On  the  other  hand,  when  oq  <  0. 
this  immediately  implies  that  the  roots  have  opposite  signs.  Thtis  exactly  one  positive  root  is  found.  In 
the  proof  of  Corollary  3  below,  this  root  will  be  demonstrated  to  minimize  the  volume  measure. 


Trace  case.  As  above  r(n)  =‘  K(n)/K,(n  -  1)  and  hi  n)  (I  +  \GJ  n))  ^  Note  that /i((n)  =  tr{6mfl(n)} 
and  also  define 


,  ,  d,(  M((«) 

i/t{n)  = 


_  tr{g(n)} 
/i((  n  —  I )  tr{fl(n  —  1 )} 

Beginning  with  (109)  it  is  easy  to  show  that 

A 


tr  {fl(ra)}  =  r(n] 


tr  {B{n  -  1)}  - 


(n)B‘‘{n  -  l)x(n) 


Ks(n  -  l)/j(n) 


so  that 


L>t(n)  =  r(n)- 


=  r(n) 


/•(  n)\ 


tr  {B{n  -  I )}  Ks(n  -  l)/i(  ri) 


[n)B^(n  -  I )a:( n) 


/■(  n  )A 


tT{B{n  -  1  )}h(n) 


x^ (n)C  ^{n  -  1  )a:( n). 


Letting  /?( n)  =' x^(  n)C  n  -  1  )x(  n  )  and  /( ;i  -  1 )  ='  tr{B(u-l)}. 


L>t,(n)  =  r{n)  - 


r(  II  )A 


l{n  -  \  )h{n) 


H{n)  =  rin)  (  I  - 


A 


I{n  -  \  )h(n) 


H(n) 


Differentiating  with  respect  to  lambda  and  setting  the  result  to  zero  yields 


di/t{n)  dr{n) 


I  - 


A 


OX  OX  \  f(n  -  I  )h{  n) 
Now  using  (116)  and  (117)  in  this  result  yields 


H{n)  - 


r(n) 


I{n  -  l)h(n) 


di/t(n) 


I 


OX 


-  1  + 


Ks(n  -  1) 
A7(u) 


€{n.&{n  -  1)  ll'* 


hHti) 
e(n.0{n  -  1))  ||^ 


1  - 


1  - 


A 


h(n)  ) 


■H(n) 


H{n)  =  0. 


K,(n  —  \  )  Ks(n  —  \  )H (ii)  I  [(n  ~  l)h(n 


I{n  -  1  )h(n) 

1  _  Gs(n)X 

h(n) 


(121) 


(122) 


(123) 


(124) 


(125) 


H(n)  =  0.  (126) 


After  algebraic  manipulation  this  becomes 

di't(n)  .V(A) 

-  =  -  (  1  z  < 

ox  D(X) 

where  .V(A)  is  precisely  the  cubic  equation  F((A)  described  in  the  proposition,  and  D(A)  =  k,)?)  - 
1  )7~’( '0/*^(”)-  Since  D(X)  >  0  for  till  A  >  0,  it  is  sufficient  to  seek  the  positive  root(s)  of  the  numerator. 

It  is  straightforward  to  show  that  coefficients  62  and  F3  are  always  positive.  When  ho  >  0  then  6]  >  0, 
so  in  this  case  there  can  be  no  positive  solution  to  F<(A)  =  0.  When  bo  <  0,  we  claim  that  there  is  exactly 
one  positive  real  root.  The  quantity  (— bo/f'a).  which  is  real  and  positive,  is  the  product  of  the  roots,  so 
there  must  be  at  least  one  real  positive  root.  The  remaining  two  are  a  complex  conjugate  pair,  or  are 
both  negative  or  both  positive.  .Now  the  quantity  (—^2/^3)  'S  the  sum  of  the  roots  and  is  negative.  This 
guarantees  that  the  remaining  two  roots  cannot  both  be  positive.  There^'ore,  the  remaining  two  are  either 
complex  or  negative,  and  the  claim  is  verified.  □ 
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sketch  of  Proof  of  Corollary  2:  The  procedure  parallels  the  steps  used  to  prove  Proposition  2  for  the 
volume  case  [68].  □ 

Proof  of  Corollary  3:  Again  wc  use  A  to  indicate  A„(n).  and  A‘  for  A*(u).  Since  ii  is  fixed,  we  write  the 
volume  ratio  of  ( 107)  to  show  its  explicit  dependence  upon  A.  A),  and  suppress  the  dependence  upon  n. 
From  (115), 


A) 


=  Q(A)/?(A) 


where,  for  n  fixed,  we  make  the  definitions 


^ ,  d=f  ’(»)  ,  n,  ^ 

-  mkh(n)—~ - r(n)(,,(n). 

h‘‘{n)  oX 


For  future  reference,  also  notice  that 


Ks{n  -  [)h{n)R{X)  =  F^(X) 


where  /’^(A)  is  the  volume  quadratic  of  Proposition  2.  This  becomes  evident  upon  comparing  (114)  and 
(120).  Consequently, 

It  is  easy  to  demonstrate  that  ^(A)  is  positive,  and  that  its  derivative  is  bounded,  for  A  6  [0.  oc).  N'ow 
with  the  aid  of  (117)  we  can  write 

,  II  e(n.©(n  -  D)  11^  6’,(n) 

=  mfc  -  l)6s(n)— —  +  2mfc - ; -  -  - .  (1.12) 

oX  uX  Ks(  n  -  1  )h^{  n  ) 

Because  of  (130)  it  is  clear  that  /?(A“)  =  0.  Reference  to  the  definition  of  R{X)  in  (129).  therefore 
immediatelv  shows  that 


Consequently. 


It  follows  immediately  that 


J  \=v 

dR(xy 

.  \=\' 

OX^  .  , 


so  that  A*  corresponds  to  a  minimum  of  //..(A)  with  respect  to  A  (see  Fig.  19).  Further,  since 

>]\=IJ  =  •  (.'^ee  (  1 16)).  and  h(w)]\^,j=I 


we  have  from  (113)  that  /-'..(A)]^_y  =  1.  and  also  that  t^(0)  =  1.  Therefor^,  from  (  12S)  and  (130), 

dX  K,(n  -  1)  K,(?;  -  1) 


where  ao  is  the  zero  order  coefficient  of  the  quadratic  which  is  negative  if  an  optimal  root  exists.  It  follows 
that  <  1  and  the  corollary  is  proven  .  □ 


Proof  of  Proposition  3:  For  simplicity,  we  write  A„(n)  as  A.  Using  (84)  from  Lemma  2.  we  can  write 
(}K(n.A)  (jl(n)'f(n)X^  +  2Gs(n)^(n)\  +  [^(n)-  !|  £(n.0(n  -  1)  ||’^] 


A'(A)  = 


d\ 


+  2Gs{n)\  1 


and 

.  .  d,(  A)  _  2[6’^(/i)  +  7(n)G^(n)]  ||  e(n.0(u  -  1)  ||'^ 

dX^  ~  (6'2(n)A^  +  2G,(n)A+  1)'^  '  ■  ' 

The  denominator  of  A'(A)  is  positive  on  A  G  (O.oc)  and  therefore  A’(A)  has  a  root  on  A  6  (O.oc)  iff  its 
numerator  does.  The  roots  of  the  numerator  are  always  real.  Moreover,  the  numerator  has  a  unique 
positive  root  on  (0,  oo)  iff  [7(71)-  ||  e(ri.Q{n  -  1 ))  ||^]  <  0.  Further  since  A'(  A)  >  0  for  all  A  >  0.  the  root, 
if  it  e.xists,  will  correspond  to  a  minimum  of  K(n.X).  H 


Proof  of  Proposition  4:  If  (67)  holds,  then  uq  of  F^fA).  and  60  of  Ft(X).  are  both  negative.  .\'ow  see  the 
proof  of  Proposition  2.  □ 
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SM-Techniques  State  Estimaaon 
(State  Bounding) 


Figure 


SM-Techniques 


SM-Techniques  for  I/O  Models 
(Parameter  Bounding) 


I - 

Bounded  Error  Method 


- 1 

Other  Constraints 


I - 

Non-Linear-  in-Parameters 
Problems 


Linear-in-Parameters 

Problems 


Bounding  Ellipsoid  Polytope 

Algorithms  Bounding  Algorithms 


- 1 

Exact  Set 

Descriptions 


1:  A  taxonomy  of  SM  methods. 


63 


rameter 

vectors 


Figure  3:  In  the  LP  model  case,  error  bounding  implies  pointwise  "hyperstrip"  regions  of  possible  parameter 
sets  in  the  space,  which,  when  intersected  over  a  given  time  range  usually  form  convex  polytopes  of  feasible 
parameters.  These  sets  are  called  fllu)  when  time  range  t  G  [l.u]  is  included.  .Associated  with  a  LSE 
problem  with  weights  A„(-)  is  a  hyperellipsoidal  set  n(u)  which  is  centered  on  the  LSE  estimate  which 
contains  the  feasible  set  fi(n)  and.  consequently,  the  true  parameters  0. .Illustrated  is  the  case  in  which 
the  parameters  comprise  a  real  vector  of  dimension  two. 


Figure  4: 


The  I'OBE  algorithm  based  on  QR-\VRLS  and  volcme  minimization.  The  case  of  a  scalar 

OI'TPI  T  IS  shown. 

INITIALIZ.\TION:  Fill  (  m  +  1 )  X  ( Di  +  1 )  working  niatri.K.  W .  with  zeros. 

A(  n  )  =  (,'(  n )  =  1 ,  n  =  1 .  'J . /;<  +  1 

k(0)  =  0 

RECURSION:  For  n  =  1.2 . 

STEP  1.  (Skip^^  if  n  <  rn  +  1)  Update  O'sl  n ).  -:{n.d(n  -  1)). 

T,{  u  -  I )  =  C~'(  «  -  1  ~  1 )  1  multiply  top  m  rows  of  TT’  by  C~'(  u  -  1 )) 

Solve  T^"(/(  -  l)g{n)  =  x(ii)  hr  g{n)  by  back  substitution. 

(’sin)  =11  g{n)  |l‘ 

s(n.  Q(n  -  1 ))  =  jj(n)  -  [n  -  \  )x(ti) 

STEP  2.  (Skip  if  n  <  ui  +  1 )  Check  for  and  compute  optimal  A’(n). 

Consider  qq  of  (44).  If  uq  >  0.  set  A*(n)  =  0.  Go  to  STEP  3. 

If  no  <  0,  solve  (  14)  for  positive  root  A*(/?). 

STEP  3.  (Skip  if  n  <  rri  +  1 )  If  A“(n)  =  0,  set 
T(n)  =  T,(n  -  1) 

0(ri)  =  e{n-  I ) 
k(i)  =  k(i  -  1 ) 
and  go  to  STEP  7, 

Otherwise,  continue. 

STEP  4.  Update  T(n). 

Replace  bottom  row  of  W  by  ^A,*!  n )  |x^^(»)  [  (/(n)j. 

Rotate  this  "new  equation"  into  W  using  Givens  rotations, 
leaving  the  result  [T{n)  \  <f|(u|j  m  the  upper  in  rows  of  W . 

These  rotations  involve  the  scalar  computations  (e.g.  [3-3]) 

^^‘jk  =  and  =  -lUjfc7(^  +  Um  +  iA-frA 

for  k  =  j,j  +  I . m  T  1  and  for  j  -  1.2 . m: 

where,  cr  =  IVjj/p.  r  =  P  -  n  m  +  i.j-  ^  unity^’ 

and  Wjk  (lFj\..)  i-s  the  j.  A- element  of  W  pre-  (post  )  rotation. 

STEP  (Skip  if  n  <  rn)  Update  9{n){n).  solving  T{n)9{n)  -  d\{n)  by 

back  substitution. 

STEP  6.  Update  k(u)  and  kin)  according  to 

kin)  =  [k(;  -  D/G'"  -  D]  +  A;fu)'.(u)(l  -  7“'(»KvNu)) 
k{  n)  =  |(  <ii(  n  )  ((‘  +  k(  n  ) 

Compute  and  store  ■)nly  k{i)  if  i  <  rn. 

STEP  7,  If  new  dataset  ((/(«+  r).x(u+  1))  available,  return  to  STEP  1. 

'  -'Generally  Tin)  does  not  become  nonsinguilT  until  n  =  m  -  1  The  first  9in)  cannot  be  computed  until  n  =  m  A  1  an. 
the  first  AGn)  at  n  =  m  +  2  We  arbitrarily  set  A(nl  =  I  an  the  initial  range. 

‘C  is  set  to  -1  to  rotate  an  equation  out  of  the  estimate  [A.tj. 

6  6 


Figure  5;  The  F-H  OBE  algorithm  circumscribes  the  intersection  of  the  current  hyperstrip  a 
hyperellipsoid,  w(n)  n  fi(n  —  1),  with  another  hyperellipsoid 


6 


■;  5- 


ItenDOns 


.  aes..  or  os.*  RLS  on  .he  U,  fas.  s.s.e,„  a„d  ,h,  s.o.  s.s.e^.  The  doued  co.ve  .ep.eseo.s 
io  each  case,  and  the  soUd  curve  the  parameter  estimate. 
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Figure  8:  (a)  ^l,;(n)  and  (b)  sgn{K(7i)}  for  the  "nonadaptive"  SM-WRLS  simulation  of  Fig.  7(b). 
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Figure  9:  Parameter  estimate  results  using  exponential  forgetting  adaptation  3vith  optimal  weights  over 
ridden  as  described  in  the  text.  Forgetting  factor  a  =  0.99.  (a)  Fast  svstem:  p  =  0.07.3.  (b)  Slow  svstem 
p  =  0.094. 
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Figure  12:  Simulation  result  of  the  windowed  SM-VVRLS  algorithm  (/  =  250).  l  a)  Fast  system;  p  =  0,094. 
|b)  Slow  system;  p  -  0.10. 
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Figure  U;  Simulation  result  of  the  windowed  SM-VVRLS  algorithm  (/  =  500)  for  the  fast  system:  p  -  0.070. 
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Figure  15;  Parameter  estimate  and  K(n)  vs.  n  for  the  fast  system  using  windowed  estimation  with  I  =  1000. 
The  process  has  eroded  and  no  longer  operates  according  to  proper  SM  principles. 
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Figure  16;  Simulation  result  of  the  selective  forgetting  SM-WRLS  algorithm,  (a)  Fast  system:  ^ 
0.050,  b  -  0.042.  (b)  Slow  system:  p  =  0.064.  6  =  0.047. 
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Figure  Ic  Parameter  estimate  results  for  exponential  forgetting  using  suboptimal  checking.  Weights 
"overridden”  as  described  in  Section  f .  .  Forgetting  factor  q  =  0.99.  (a)  Fast  system:  p  =  0.o':lx. 
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Table  1: 

Approximate  computational  complexities  in  average  number  of  cflops  per  data  set  for  the  various  techniques 
discussed  in  the  text,  m  is  the  number  of  parameters  in  the  model:  k  the  dimension  of  the  output  vector: 
p  and  p'  represent  the  average  number  of  data  sets  accepted  per  n  in  the  optimal  and  suboptimal  cases, 
respectively  (typically  p'  <  p):  and  b  and  b'  are  the  average  number  of  back-rotations  performed  per  n  in 
the  optimal  and  suboptimal  cases,  respectively  (typically  b'  <  b).  For  each  sequential  algorithm  scaling 
or  adaptation  by  exponential  forgetting  require  0.5m^  +  (k  +  0.5)r7r  chops  for  each  procedure.  If  both 
procedures  arc  to  be  used,  they  can  be  combined  and  implemented  at  about  the  same  cost  as  a  single 
procedure.  In  the  parallel  cases,  scaling  and  exponential  forgetting  can  be  achieved  at  virtually  no  cost.  In 
the  parallel  processing  cases,  the  loads  in  the  table  represent  parallel  comple.xities  (see  text),  and  results 
are  for  the  case  k  =  1  since  architectures  for  the  MO  case  have  no  been  devloped. 
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Abstract 

A  class  of  algorithms  is  presented  for  training  multilayer  perceptrons  which  implement  nonlinear 
mappings  using  purely  “linear”  techniques.  The  methods  are  based  upon  linearizations  of  the 
network  using  error  surface  analysis,  followed  by  a  contemporary  least  squares  estimation  procedure. 
Specific  algorithms  are  presented  to  estimate  weights  node-wise,  layer-wise,  and  for  estimating  the 
entire  set  of  network  weights  simultaneously.  In  several  experimental  studies,  the  node-wise  method 
is  superior  to  back-propagation  and  an  alternative  linearization  method  due  to  Azimi-Sadjadi  et  al. 
in  terms  of  number  of  convergences  and  convergence  rate.  The  layer  and  network-  wise  updating 
offers  further  improvement. 
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1  Introduction 


This  paper  introduces  a  new  class  of  learning  algorithms  for  multilayer  perceptrons  (MLP)  with 
improved  convergence  properties.  In  spite  of  the  nonlinearites  present  in  the  dynamics  of  a  MLP. 
the  learning  algorithm  is  purely  "‘linear’’  in  the  sense  that  it  is  based  on  a  contemporary  veisioii 
of  the  conventional  recursive  least  squares  (RLS)  algorithm  (e.g.  [1]).  Accordingly,  unlike  the 
popular  “nonlinear”  algorithms  used  to  train  MLPs,  the  linear  algorithm  and  its  potential  variants 
will  benefit  from  the  well-understood  theoretical  properties  of  RLS  and  VLSI  architectures  for  its 
implementation. 

MLP  is  a  an  artifical  neural  network  consisting  of  nodes  grouped  into  layers.  In  this  paper,  we 
consider  a  two-layer  network*,  an  example  of  which  is  illustrated  in  Fig.  1,  but  the  generalization 
of  the  method  to  an  arbitrary  number  of  layers  will  be  obvious.  Each  node  above  the  input  layer 
in  the  MLP  passes  the  sum  of  its  weighted  inputs  through  a  non-linearity  to  produce  its  output. 
The  inputs  to  layer  zero  are  external.  The  outputs  of  the  last  layer  are  the  outputs  of  the  network. 

Let  us  now  formalize  the  network  and  define  notation.  The  number  of  nodes  in  layer  i  is 
denoted  .V,.  with  .Vg  indicating  the  number  of  input  nodes  at  the  bottom  of  the  network.  The 
weights  connecting  to  node  k  {k')  of  layer  two  (one)  are  held  in  the  rVi-vector  (A’q- vector)  u'/. 

The  inputs  to  the  nodes  in  all  layers  except  the  first  are  the  outputs  of  the  layer  below.  W  e 
denote  by  N  the  number  of  training  patterns 

{(x(n},t(n)),  n  =  1,2,...,.V},  (l! 

in  which  each  x(n)  is  an  Ag- vector  of  inputs  to  the  bottom  layer  of  the  network,  and  each  t{  n  )  i.-.  an 

'Some  authors  might  choose  to  call  this  a  three  layer  network.  We  shall  designate  the  bottom  layer  of  “nodes"  i- 
"layer  zero"  and  not  count  it  in  the  total  number  of  layers.  Layer  zero  is  a  set  of  linear  nodes  which  simply  p,i.-,s  t  in 
inputs  unaltered.  For  this  reason,  we  choose  not  to  show  circular  nodes  in  the  diagram. 
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.V2-vector  of  target  outputs  (final  layer  outputs  which  are  desired  in  response  to  the  corresponding 
input).  The  computed  outputs  of  layer  two  (one)  in  response  to  a:(n)  comprise  the  A’2-vector  (.Vi- 
vector)  y{n)  {y'{n)).  Throughout  the  discussion,  will  be  used  to  denote  the  element  of  vector 
r. 

Finally,  we  need  to  formalize  the  nonlinearity  associated  with  the  nodes.  For  given  weights,  m;.. 
connected  to  node  k  of  the  final  layer,  for  example,  the  output  in  response  to  input  x(n]  is 

yk(n)  =  Siwly'(n))  (2) 


in  which  5(-)  is  a  nonlinear  mapping.  Typically,  for  example,  a  sigmoidal  function  would  be  used; 


S{a) 


1 

l+e-^"' 


(3) 


.\ny  function  which  is  once  differentiable  can  be  employed  in  the  methods  to  be  presented.  Finally, 
for  convenience  we  also  define 

Ui(n)  ti7^y'(n).  (1) 

Clearly,  Uk{n)  is  the  input  to  node  k  in  the  output  layer  in  response  to  pattern  n.  u)(ti)  is  similarly 
defined  as  the  input  to  node  /  in  layer  one. 

Many  training  (weight  estimation)  algorithms  exist  for  this  type  of  network  [2]  -  [6].  The  most 
popular  is  the  so-called  back- propagation  algorithm  [-5],  [6].  Back-propagation  performs  satisfac¬ 
torily  in  some  cases  if  given  enough  time  to  converge.  However,  convergence  can  be  too  slow  for 
many  applications  (e.g.  [7]) 

One  attempt  to  develop  faster  training  methods  is  represented  by  the  class  of  algorithms  in 
which  the  network  mapping  is  “linearized"  in  some  sense  in  order  to  take  advantage  of  liiuMi 
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estimation  algorithms.  In  particular,  5(-)  can  be  replaced  by  a  linear  approximation.  recent 
example  employing  this  strategy  is  the  method  reported  by  Azimi-Sadjadi  et  al.  [2].  We  shall  refer 
to  this  technique  as  the  AS  algorithm.  While  the  initial  method  developed  in  this  paper  will  be 
shown  to  be  equivalent  in  certain  theoretical  senses  to  the  A-S  algorithm,  its  derivation  is  (juite 
different  (providing  a  second  interpretation  of  the  underlying  linearization)  and  the  implementation 
approach  will  result  in  significantly  improved  performance. 

2  Linearization  Algorithm 

The  training  problem  for  the  two  layer  MLP  is  stated  as  follows:  Given  a  set  of  N  training  patterns 
as  in  (1),  find  the  network  weights  which  minimize  the  sum  of  squared  errors, 

.V 

E  =  Y,(t(n)  -  y(n)f{tin)  -  y{n)).  (o) 

n=  1 

The  purpose  of  this  section  is  to  describe  the  theoretical  basis  for  a  'iinearized”  solution  of  this 
problem. 

Before  continuing,  we  note  a  simple  fact  which  will  reduce  the  number  of  details  in  our  discussion. 
It  is  easy  to  show  from  (5)  that  if  only  the  weights  connected  to  the  output  layer  are  allowed  to 
change,  with  the  other  weights  in  the  network  held  constant,  then  E  is  minimized  by  minimizing 
the  errors  associated  with  each  node  independently.  That  is,  if  Ek  denotes  the  error  associated 
with  output  node  k, 

,v 

(b) 

n=l 

then  E  =  Yikl\  layer,  each  node  in  that  layer  involves  a  distinct  set  of  weights 

(those  directly  connected  to  it),  and  each  set  of  weights  may  be  optimized  (to  reduce  its  node's 
error)  independently  of  the  others.  This  means  that  without  loss  of  generality,  we  may  focus  on  a 
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single  output  node.  (Whatever  optimization  method  is  discovered  for  this  node  will  then  be  applied 
to  other  nodes  in  the  layer.)  Let  us  concentrate  on,  say,  node  k  and  seek  weights  which  minimize 
(6). 

First  we  wish  to  concentrate  on  the  training  of  the  weights  in  the  final  layer,  so  let  us  write 
in  a  form  which  explicitly  features  these  weights, 

,v 

Ek  =  -  Siwly'in))]'^.  (7) 

n=l 

.\lgorithms  for  finding  the  optimal  solution,  say  to  this  problem  are  well-known  if  the  niodeleil 
output  depends  only  upon  a  linear  combination  of  pattern-invariant  (constant)  weights.  In  the 
linear  case  yk(n)  =  S{wky'{n))  =  Jwky'in),  for  some  constant  (3  (which  can  be  taken  as  unity 
without  loss  of  generality),  and  the  error  expression  takes  the  form 

.V 

Ek  =  J2[tkin)-wly’{n)]'^.  (8) 

n=l 

The  solution  in  the  linear  case  is  the  solution  to  the  classical  linear  least  squares  “normal  equations" 
[8].  The  solution  of  the  normal  equations  can  proceed  in  a  variety  of  ways.  It  is  also  possible  to 
arrive  at  the  solution  without  explicitly  forming  the  normal  equations.  This  is  the  case,  for  example, 
when  using  the  least  mean  square  (L.\IS)  (e.g.  see  [9]  or  [10])  algorithm,  a  recursive  solution  which 
amounts  to  “back-propagation"  for  a  linear  network.  A  second  popular  method  is  the  conventional 
recursive  least  square  (RLS)  algorithm  (e.g.  see  [11]).  \  contemporary  version  of  the  latter  will 
serve  as  a  computational  basis  for  the  algorithm  be  described  in  this  paper,  and  RLS  is  also  the 
basis  for  the  .4-S  algorithm  to  which  we  wish  to  relate  the  method  of  this  paper.  .Appropriate' 
description  and  formalism  will  be  introduced  as  needed. 

It  is  well-known  that  least  squares  estimation  problems  may  be  discussed  in  terms  of  their  I'rmr 
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surfaces,  in  this  case  the  graph  of  as  a  function  of  Wk-  Whatever  the  form  of  the  least  square 
estimation  algorithm,  the  ideal  goal  is  to  find  the  weight  vector,  say  tuj,  corresponding  to  the  global 
minimum  of  Ek{wk)-  It  is  important  to  future  developments  to  note  that  Ek  depends  not  only 
on  Wk  but  also  upon  the  training  patterns  {(x(n),  tk(n)),  n  €  [l^-iV]}  (see  (7)).  (In  fact,  since  wo 
have  “frozen”  the  weights  in  the  first  layer,  it  is  more  appropriate  in  this  case  to  view  Ek  as  a 
function  of  Wk  and  the  pairs  {{y'(n),tkin)),n  £  [1,.V|}.)  Once  the  training  patterns  are  fixed,  the 
error  function  may  be  described  as  a  surface  over  the  ;Vi-dimensional  hyperplane  corresponding  to 
the  weights.  Theoretically,  the  pairs  {iy'{n)Jk{n)),n  £  [l,iV]}  represent  partial  realizations  of  a 
two  dimensional  stochastic  process  which  generates  them.  In  this  sense 

Ek{wk.  {(y'(n)J.k(n)),n  £  [1,^V]})  (9) 

is  only  a  sample  error  surface.  In  a  pure  sense,  we  would  like  to  find  weights  corresponding  to  the 
global  minimum  of  S  {Eki^k)}  where  S  denotes  the  expected  value.  We  must  be  content,  however, 
to  work  with  the  sample  surface  provided  by  the  training  data. 

The  point  of  this  discussion  of  error  surfaces  is  to  note  that  different  algorithms  construct  and 
use  different  sample  error  surfaces  from  the  data.  With  LMS  (or  back-propagation),  error  surfaces 
are  sequentially  constructed  from  individual  training  patterns,  i.e.,  error  surfaces  of  the  form 

Ek{wk,{y'(n)Jk{n))),  n=l,2,...,.V  (10) 

are  created,  and  for  each  n,  the  weights  are  moved  in  the  direction  of  the  negative  gradient  on 
that  surface.  The  convergence  properties  are  well-understood.  RLS^,  on  the  other  hand,  creates 
^Of  course,  here  we  are  speaking  of  a  linear  model  identification. 
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sequentially  more  refined  error  surfaces  of  the  form 

Ek(wk,{iy'U)^tkij)),j  €  [l.n]})  (11) 

as  n  is  incremented.  At  each  step,  if  a  weight  update  is  computed,  the  solution  corresponds  to  the 
unique  minimum  of  the  newly  refined  surface.  We  can  appreciate,  therefore,  that  even  if  we  neglect 
nonlinearities,  the  estimation  processes  behave  quite  differently  with  respect  their  error  surface 
analysis. 

The  linearization  technique  adopted  in  this  work  can  be  explained  in  terms  of  the  error  surface 
analysis.  The  error  surface  over  which  we  would  like  to  find  the  (g)obal)  minimum  by  choice  of 
weights  is  given  by  (7).  Suppose  we  wish  to  construct  a  “linearized”  error  surface,  say  Ek-  which 
is  "similar"  in  some  sense  to  Ek  in  a  neighborhood  of  the  present  weights.  Recalling  that  Ek  is  a 
function  not  only  of  the  weights,  but  also  of  the  training  patterns,  the  fundamental  question  is;  Can 
the  pairs  {{y'(n),t{n)),n  €  [1.*^]}  be  modified  in  some  sense,  say  (y'{n),t(n))  (y'kin).  ik{o.)). 
so  that 

y 

Ek(wk,  {{y'kin)Jk(n)},ne  (l,iV]})  =  ^[hin)  -  wly'i^(n)]^  (12) 

n=l 

N 

%  Ek{wk,  {iy'(n),tkin)),n  G  [l.iV]})  =  -  S(wly'[n))]^ 

n=l 

in  some  neighborhood  of  the  present  weights?  The  answer  to  this  question  is  the  key  theoretical 
development  described  in  the  following  paragraphs. 

In  the  ensuing  discussion,  the  notation  tuj  will  be  used  to  designate  a  local  minimum  of  Ek- 
Ideally,  will  be  the  global  minimum,  but  we  have  no  way  to  assure  this.  The  objective  is  to 
find,  by  a  “bnear"  algorithm,  a  close  approximation  to  tuj. 
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The  algorithm  to  be  described  proceeds  in  iterations,  indexed  by  i  =  1,2,....  Each  iteration 


represents  one  complete  training  cycle  through  the  N  training  patterns.  Suppose  that  a  weight 
vector  estimate  Wk{i  -  1)  results  from  iteration  i-  1.  In  iteration  i,  by  manipulation  of  the  data,  we 
work  with  a  “linearized”  error  surface  which  is  similar  to  the  nonbnear  surface  in  the  neighborhood 
of  Wk{i  -  1).  The  similarity  follows  from  two  criteria: 

1.  Ek(wk{i-  €  [1,.V]})  =  Ekiwk(i-  1),  {(yfc(n),  4(n)),  n  €  [E-V]}): 

2.  ^1  =4^] 

The  first  task  is  to  manipulate  the  pairs  {(y'^(n).ffc(n)),  n  €  [1,  :V]}  so  that  these  criteria  hold.  This 
is  accomplished  as  follows.  It  follows  from  Criterion  1  that 

;V  .V 

-  ykM?  =  Y^ih(n)  -  wlii  -  l)j/;(n))^  (Id) 

n=l  n=l 


By  letting 

tk{n)  -  yk{n)  =  fk(n)  -  wl(i  -  l)yt(n). 


or 

ik{n)  =  {tk{n)  -  ykin))  +  wl{i  -  l)yfc(n),  (  15) 


for  each  n.  Criterion  1  is  met.  Now  we  take  the  partial  derivatives  required  in  Criterion  2.  For  ilie 
•‘nonlinear’’  error. 


.V 

-2  ^(4(n)  -  yk(n))S{uk(Ti))y'(n) 

n=l 

.V 

-2  ^{tfc(n)  -  yk{n))S{wl(i  -  l)y'(n))y'(n)  {  Hi' 

n=  1 


dEk 


dwk 


W^  =  Wi,{t-\) 


where 


S(uk{n)) 


M  dS{a) 
da 


(17) 


All  inputs  and  outputs  in  this  and  similar  expressions  are  those  associated  with  weights  Wk{i  -  1 )  (or 
the  “current”  set  of  weights  around  which  linearization  is  taking  place),  but  we  will  avoid  writing 
Uk(i  ~  l,n),  for  example,  for  simplicity.  For  the  “linear”  error. 


.V 

=  -2^{ffc(n)  -  wl{i  -  i)y'kin))y'k{n).  (18) 

W^=Wkit-l)  rt=l 

Equating  (16)  and  (18),  in  light  of  (14)  we  have 


dE  I 
dwk 


y[{n)  =  S{wl{i  -  l)y'(n))y’(n). 


(19) 


.\11  quantities  needed  to  compute  the  modified  pair  (fjfe(n),yi(ti))  are  known  or  can  be  calculated 
at  pattern  n.  This  procedure  is  repeated  for  each  k  (output  node). 

Defore  extending  the  analysis  down  to  layer  one,  let  us  ponder  the  significance  of  what  we 
have  done.  By  modifying  the  data  pairs,  we  have  created  a  “linear”  error  surface  which  is  sim¬ 
ilar  to  the  “nonlinear”  one  in  the  neighborhood  of  w^ii  -  1).  In  particular,  the  error  surfaces 
match  at  that  point,  and  their  gradients  are  identical  with  respect  to  the  weight  vectors.  We  can 
find  the  Wk  which  minimizes  Ek  by  simple  linear  least  squares  processing  of  the  modified  data 
{( 4(  n),  y'(n)),  n  €  [1,*V]}.  The  linear  estimate  will  correspond  to  a  minimum  of  the  error  surface 
which  need  not  be  near  a  minimum  of  Ek-  However,  because  the  error  surfaces  and  the  gradi¬ 
ents  match  with  respect  to  the  weight  vector  of  node  k,  if  the  weight  change  is  small  enough,  the 
weight  change  will  be  in  the  direction  of  decreasing  Ek-  Accordingly,  the  linear  weights  must  he 
constrained  to  remain  in  a  reasonably  small  neighborhood  of  Wk{i  -  1).  Because  Ek  is  reduced  at 
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each  iteration,  it  is  to  be  expected  that  a  minimum  of  E  will  be  reached  by  repeating  this  procedure. 

In  turn,  this  implies  convergence  to  the  “nonlinear”  solution  for  the  weights,  using  purely  linear 
techniques. 

Let  us  now  move  down  to  the  lower  layer  and  consider  the  estimation  of  the  weights  |tn’  j  €  [l,.^ 

By  similar  reasoning  to  the  above,  we  may  focus  on  a  single  node,  say  node  i.  However,  we  must  now 
optimize  with  respect  to  the  entire  external  error,  E,  since  all  nodes  in  the  upper  layer  are  af¬ 
fected  by  these  weights.  Suppose  that  we  arc  working  on  the  cycle  through  the  training  patterns 
and  that  all  weights  in  the  upper  layer  are  fixed  at  their  newly  updated  values  {tnfc(i),  k  6  [1. 

Taking  the  derivative  of  E  with  respect  to  inj, 

f)  F  '^2  ^ 

=  -2  V  -  t/^(n))5(uj(n))u7j,/(i)5([ini]^x{n))x(n)  (20) 

where  Wjj(i}  denotes  element  in  weight  vector  Wj(i)  (weight  on  connection  from  node  /  in  layer 
one  to  node  J  in  layer  two).  This  expression  can  be  written 

(If  ,  -r 

—r  =  -2j^(t,(n)-  t/,(n))S([wij^x(n))x(n)  Cil  i 

where  fj(n)  is  called  the  target  value  for  inner  node  /  and  is  defined  such  that 

.V2 

(U{n)  -  y't(n))  =  '^(tj{n)  -  yj{n))S(Uj(n))Wji(i).  (22) 

r=i 

The  quantity  on  the  right  side  of  (22)  is  commonly  called  the  hack- propagated  error  for  node  /.  I'lie 
solution  sought,  say  w'.  is  one  for  which 


OE 

Ow'i 


=  0. 


tn,=w,- 


( 2:i ) 
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In  the  top  layer,  for  node  k  we  sought  toj  such  that 


1^1  =0.  ,.4) 

With  reference  to  (16),  it  is  clear  that  the  present  optimization  problem  is  equivalent  to  the  ones 
encountered  at  the  upper  nodes.  In  particular,  the  same  linearization  considerations  can  be  applied 
to  obtain  modified  input  and  target  values,  say 

(t'i{n).x(n))  —  {t'i{n),X!(n))  (25) 

and  the  set  of  layer  one  weights  mj(i)  computed  accordingly  for  each  1. 

Before  continuing,  let  us  note  the  relationship  to  the  A-S  algorithm  noted  above.  In  fact,  to  this 
point  in  the  discussion,  the  methods  are  nearly  equivalent  though  derived  from  different  starting 
points.  In  the  iteration  through  the  training  patterns,  prior  to  updating  the  weights  at 
pattern  n,  node  k  is  “Linearized”  in  the  .4-S  algorithm  by  approximating  5(a)  by  a  linear  function 
which  is  tangential  to^  5  at  wj{i  -  l)r/'(Ti).  In  effect,  S{a)  is  approximated  by  the  first  two  terms 
of  a  Taylor  series, 

5(a)  %  Sia)  =  Siwlii  -  l)y'(n))(a  -  wlii  -  l)y'(n))  +  S(wl(i  -  l)y'{n))  (26) 

=  S(wl(i  -  l)y'(n))Q  +  [SiwJii  -  l)y'(n))  -  S{wl{i  -  l)y'{Ti))wl{i  -  l)y'(n)] 
Kkin)a  +  bk{n)- 

.Vzimi-Sadjadi  et  al.  [2]  recognized  that  by  using  this  approximation  in  (16),  the  opMmization 

^In  fact,  if  ii)(i,  n  —  1)  denotes  the  weight  estimates  after  pattern  n- 1  in  iteration  i,  then  in  the  A-S  algorithm,  S'(  ■) 
is  linearized  around  these  weights  rather  than  the  weights  at  the  end  of  the  previous  cycle.  Of  course,  this  proccs.s 
could  also  be  used  in  our  algorithm,  but  we  find  that  it  makes  no  significant  difference,  and  the  computational  expense 
of  updating  the  weights  at  each  n  is  avoided  in  our  case. 


problem  became  equivalent  to  a  set  of  linear  least  square  error  normal  equations  if  the  data 
were  r.jdified  according  to  (15)  and  (19).  Therefore,  by  quite  different  means,  the  theoretical 
developments  arrive  at  the  same  set  of  linear  equations  to  be  solved. 

In  principle,  once  the  linearization  is  achieved  at  iteration  i  and  pattern  n,  any  least  mean 
square  type  algorithm  can  be  employed  to  update  the  weight  estimates.  The  A-S  method  uses  the 
conventional  RLS  algorithm.  In  this  case,  neglecting  any  error  weighting,  RLS  takes  the  form  of 
the  two  recursions  (written  for  node  k  in  the  top  layer)  [11,  Ch.5], 


Wk{i,n)  =  n  -  1) -h  P(n)y[[.(n)[4(n)  -  [y<:(n)]. 


(28) 


■Wk{i,n)  is  the  estimate  of  the  weights  Wk  followii  g  pattern  n  in  the  iteration  through  the 
training  data,  and  is  the  covariance  matrix  at  the  same  “time”  in  the  process, 


(->9) 

J  =  l 

def 

.Vote  that  Wk{i,0)  =  Wk{i  -  1,  A)  and  similarly  for  the  covariance  matrix.  This  presents  the 
question  of  how  ujfc(0,0)  and  P(0,0)  should  be  initialized.  The  inverse  covariance  matrix  contains 
theoretically  infinite  values  at  the  outset  and  a  proper  initialization  for  the  weights  is  practically 
not  known  (this  means  that  the  initial  linearizations  of  the  training  data  are  based  on  potentially 
very  bad  weight  estimates).  This  issue  will  be  addressed  further  below.  .-Vlso.  it  is  clear  that 
this  solution,  as  written,  will  continue  to  “accumulate”  past  linearized  sets  of  data  which  might, 
in  fact,  be  linearized  around  very  poor  weight  estimates.  Therefore,  the  .A.-S  algorithm  includes 
a  “forgetting  factor”  [11]  in  the  RLS  recursions.  This  is  equivalent  to  using  a  weighted  error 
criterion  with  time  varying  (exponentially  decaying)  weights.  This  can  make  convergence  slow  if 


the  forgetting  factor  is  large.  If  the  forgetting  factor  is  small,  then  past  values  are  forgotten  more 
quickly,  but  leads  to  convergence  problems.  We  will  also  comment  further  on  this  issue  below. 

We  have  found  that  the  choice  of  conventional  RLS  as  a  solution  method  seriously  impairs 
the  ability  of  this  linearization  method  to  converge  on  a  proper  set  of  network  weights.  As  an 
alternative,  therefore,  we  suggest  the  method  presented  in  the  following  section. 

3  Solution  by  QR  Decomposition 

In  order  to  improve  convergence  the  algorithm  developed  above  can  be  implemented  using  QR 
decomposition  [1,  8].  This  algorithm  ha.s  distinct  advantages  over  conventional  RLS.  First,  the 
QR  algorithm  does  not  suffer  from  initialization  problems  noted  above  for  RLS.  It  also  permits  the 
inclusion  of  several  very  flexible  “forgetting”  strategies.  To  illustrate  the  operation  of  the  algorithm, 
it  is  sufficient  to  consider  the  estimation  of  weights  Wk  in  the  output  layer  of  the  network.  .\11 
notation  is  consistent  with  that  used  above  except  the  number  of  nodes  in  layer  one  is  denoted  M . 

In  effect,  the  linearization  technique  described  above  reduces  the  problem  at  the  iteration 
through  the  training  patterns  to  one  of  finding  the  least  square  error  solution  of  the  overdetermined 
system  of  equations 


- 

4(1) 

ujfc(i)  = 

4(2) 

4(iV) 

The  QR  decomposition  method  is  based  upon  transforming  this  system  into  an  upper  triangul.Tr 
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system  by  applying  a  series  of  orthonormal  operators  (Givens  rotations).  The  resulting  system  is 

diiN) 

tufc(t)  =  _  (31) 

_  d2iN)  _ 

where  the  matrix  T{N)  is  M  x  M  upper  triangular  and  0,xj  denotes  the  i  x  j  zero  matrix.  The 
solution  for  10^(1)  is  easily  obtained  by  back-substitution.  A  recursive  version  of  the  solution  is  also 
possible.  The  recursive  algorithm  is  shown  in  Fig.  2.  For  details  the  reader  is  referred  to  [1]. 

For  discussion  of  further  benefits  of  the  decomposition  algorithm,  it  is  useful  to  view  the  A 
matix,  defined  in  Fig.  2,  as  four  partitions.  Following  the  rotation  of  the  equation,  in  Step  2, 
for  example, 

Tin)  d,(n) 

A=  _  (32) 

OixM  d2in) 

As  is  the  case  with  the  A-S  method,  a  forgetting  factor  must  be  employed  to  gradually  reduce 
the  effects  of  earlier  linearizations.  This  is  very  easily  accomplished  in  the  QR  algorithm  by  simply 
multiplying  the  top  M  rows  of  the  matrix  A  (matrix  T(n)  and  vector  di{n))  by  a  factor  3  <  1 
prior  to  the  rotation  of  the  n  +  1*'  pattern  equation.  In  this  context,  both  the  forgetting  factor 
and  the  frequency  of  weight  updates  can  be  varied.  In  addition  to  exponentiaJ  forgetting  factors, 
equations  can  be  “rotated  out”  of  the  matrix.  This  is  done  by  changing  5  in  Fig.  2  to  -1  and 
rotating  in  the  equation  to  eliminate.  Thus,  for  example,  only  the  last  Q  >  M  equations  can  be 
used  to  calculate  the  weight  updates  by  sequentially  removing  equation  n  —  Q  +  l  prior  to  inclusion 
of  equation  n.  This  procedure  effects  a  sliding  window  over  which  the  estimates  are  computed. 
.\nother  forgetting  method  useful  for  MLPs  is  possible  because  no  initialization  of  the  updating 
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equations  is  necessary.  Because  there  are  no  initialization  problems  the  system  can  be  re-initialized 
at  any  step,  thus  completely  “forgetting”  the  past  linearized  values.  These  and  a  number  of  other 
flexible  forgetting  strategies  made  possible  by  this  algorithm  may  prove  very  useful  in  the  training 
of  MLPs. 

In  addition  to  new  forgetting  factors,  using  the  QR  implementation  also  allows  the  frequency 
of  updating  of  the  weights  to  vary.  As  with  conventional  RLS,  the  weights  can  be  updated  every 
time  a  new  linearization  has  been  used”*. 

The  theoretical  results  above,  along  with  those  in  Section  2,  can  be  combined  to  form  a  learning 
algorithm  for  MLPs.  First,  the  weights  of  the  network  must  be  initialized.  This  is  done  randomly, 
each  weight  being  selected  from  a  uniform  distribution  over  the  set  [-1,1].  Once  the  initial  weights 
are  chosen,  the  weight  updating  can  begin.  First,  a  training  pattern  is  input  to  the  system.  Because 
the  weights  are  not  updated  until  all  the  train’ ng  patterns  have  been  used,  convergence  does 
not  depend  on  the  order  in  which  the  training  patterns  are  used.  Given  a  training  pattern,  the 
algorithm  calculates  linearized  training  patterns  for  the  last  layer  nodes  and  these  are  rotated  into 
the  corresponding  A  matrix.  Each  node  has  a  “separate”  A  matrix.  The  target  outputs  of  the 
layer  below  are  calculated  next  using  back-  propagation.  The  A  matrices  for  the  first  layer  are 
then  updated.  A  new  training  pattern  is  then  used  to  calculate  a  new  set  of  linearized  inputs  and 
outputs.  This  is  repeated  until  all  the  training  patterns  have  been  used.  The  A  matrices  are  then 
used  to  calculate  updated  weights.  This  continues  until  the  network  converges  to  a  solution.  By 
definition  the  solution  is  said  to  have  converged  when  the  change  of  the  norm  of  the  vector  of  all  the 
weights  is  below  a  threshold.  As  with  other  training  algorithms  for  MLPs,  this  algorithm  may  not 
converge  to  the  weights  corresponding  to  the  global  minimum  of  the  function  of  E.  Also,  although 
the  algorithm  approximates  a  gradient  system,  because  it  is  not  a  gradient  system,  there  is  no 

'This  can  be  as  often  as  every  pattern  (see  Footnote  3).  or  at  the  end  of  each  iteration  through  the  patterns  a.s 
has  been  our  convention. 
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guarantee  that  the  algorithm  will  converge  to  any  solution.  Thus  for  implementation  a  ma.\imuni 
is  placed  on  the  number  of  iterations. 

4  Complete  Layer  and  Network  Updating 

The  back-propagation  algorithm  updates  each  weight  at  each  node  individually.  All  the  weights  in 
the  network  except  one  are  fixed  and  this  is  changed  to  reduce  E.  The  algorithm  described  in  the 
previous  section  updates  all  the  weights  connected  to  one  node  simultaneously.  All  the  weights  in 
the  network  except  those  connected  to  one  node  are  fixed,  and  those  weights  are  updated  to  reduce 
E.  These  two  methods  of  updating  weights  may  not  be  optimal  because  £  is  a  function  of  all  the 
network  weights  and  may  not  be  minimized  by  updating  the  weights  of  each  node  independently. 
.Minimizing  an  error  implies  there  is  a  target  value.  There  are  no  given  target  values  for  the  inner 
layers  so  these  are  computed  assuming  the  weights  of  the  layers  above  are  fixed.  These  target  values 
allow  the  weights  of  each  node  to  be  updated  independently.  This  makes  the  computations  easier, 
but  does  not  take  into  consideration  the  interdependence  of  the  nodes. 

The  next  step  in  the  development  is  to  demonstrate  how  to  update  all  the  weights  connected 
to  one  layer  simultaneously. 

The  following  derivation  uses  the  linearization  of  the  nonlinearity  suggested  by  (26).  Note  that 

/V, 

yk{n)  =  S{wjy'{n))  =  S(^  Wk,jyj(n))  [X]) 

t=i 

and 

-Vo 

y'(n)  =  5([tij']^a:(n))  =  5(^  l31) 

/=] 


1.5 


The  linearization  replaces  S(u)  by  Ku  +  b.  Thus 


N, 

ykin)  =  Kk{n)('^Wk,jyjin))  i-  bkin)  (35) 


and 


No 

y'jin)  =  h"j(n)(Ylw]jXi(n))  +  b'^{n). 


i=\ 


(36) 


so 


A'l  No 

yk(n)  =  Kk(n){Y^  Wk.j[h"j(n){'^  w'j,xt{n))  +  6'(ra)])  +  bk{n). 


j  =  i 


l=i 


To  update  the  weights  in  layer  two,  the  weights  in  layer  one  of  the  network  are  fixed.  With 
these  weights  fixed,  the  weights  connected  to  different  nodes  in  the  output  layer  can  be  updated 
independently  and  the  same  update  equations  as  in  the  previous  section  result.  To  update  the 
weights  in  the  first  layer,  the  weights  in  the  last  layer  are  fixed.  Thus 

•Vi  No  A’l 

ykin)  =  Kk(n)Y^  Wk.jKj{n)Y^  Wjixiin)  +  Kk(n)^Wk,jb'j  +  bk{n),  (38) 

;=1  <=1  j=l 


or 


Ni  No 


A. 


ykin)  =  '^'^(Kk{n)wk,jKj{n)xi(n))xVji  +  {'^{Kk(n)wk.jl>'^  +  l>k(n)] 


(66) 


j=i /=i 


r=i 


so 

A,  A,  No 

Ski^)-  Kk(n)wk.jb'j{n)  bk(n)]  =  ^  A'jt(n)u)*,j/v'(n)x((n))u;' ,.  (10) 

j=i  j=i l=\ 

Hence  for  one  output,  this  is  the  same  as  a  linear  system  with  Nq  x  Ni  inputs  and  one  output.  The 
linearized  output  and  inputs  are 


A, 

hin)  =  (k{n)  -  ^[AT(n)H’fc.j6'(n)]  +  bk{n). 
j=i 


(11) 
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and 


Xjj  =  Kkin)wk,jKj(n)xt{n).  (42) 

For  one  output,  yk{n),  and  N  training  patterns, 

.V 

Ek  =  ykin)^,  (43) 

n=l 

while  for  N2  outputs  and  one  training  pattern 

,V2 

E=  ^(4(n)-y*(n))2,  (44) 

k=l 

where  j/fc(n)  has  the  form  above.  The  weights  in  the  output  layer  are  held  constant,  so  all  the 
are  a  function  of  all  the  weights  in  the  first  layer.  Because  the  yk{n)  for  all  k  and  for  all  n  are 
functions  of  the  same  weights,  we  can  use  the  same  technique  to  update  the  weights  of  the  two 
systems  above,  given  by  (43)  and  (44).  Thus,  the  N2  output  network  with  one  training  pattern  is 
treated  as  a  one  output  network  with  N2  training  patterns.  This  is  done  for  each  training  pattern. 
Thus  if  there  are  N  training  patterns  and  N2  outputs,  the  number  of  linearized  training  patterns 
is  .V  X  ,V2-  This  method  allows  us  to  update  all  the  weights  in  the  same  layer  simultaneously. 

Ultimately  the  goal  is  to  update  all  the  weights  of  the  network  simultaneously.  Some  improve¬ 
ment  in  convergence  can  be  expected  because  the  weights  are  not  independent. 

Simultaneous  updating  of  all  weights  can  easily  be  accomplished  for  a  one  output  network  using 
the  derivation  above.  From  (38) 

■Vi  .Vo  iV, 

yk(n)  =  '^Y^[Kk(n)I\j{n)x,(n)]wk.jiVj,  +  ^[A'fc(Ti)6'(n)]u.’fc,j  ^bk(n).  (45) 

;=!/=!  J=1 
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Letting 


w^^,  =  Wk,jWji,  (-16) 

then 

Ni  .Vo  Vi 

yk{n)  =  ^[A'*(n)A''(n)ir/(n)]w],  +  ^[A"*:(n)6' (n)]u?jt.j  +  bk(n)  (47) 

j=i  /=i  ’  j=i 

or 

V,  Vo  V, 

(yfc(n)  -  6fc(n))  =  +  Hl^fc(n)h'(n)]u;fc,j.  (48) 

j=i;=i  j=i 

This  is  a  linear  system  with  one  output  and  Nq  x  ^Vj  +  iVj  inputs.  The  system  can  be  solved  for 
iL'k.j  and  wl  j  and  (46)  can  be  used  to  solve  for 

5  Experimental  Results 

5.1  Single  Node  Updating 

The  results  given  in  this  section  compare  three  training  strategies  for  an  MLP.  These  are: 

1.  Conventional  back-propagation  (no  linearization). 

2.  Conventional  RLS  with  a  forgetting  factor  (A-S  Algorithm). 

3.  QR  decomposition  with  an  e.xponential  forgetting  factor. 

Each  of  the  three  strategies  was  used  to  train  each  of  the  following  networks: 

1.  a  two-bit  parity  checker, 

2.  a  four-bit  parity  checker,  and 

3.  a  four-bit  bit  counter. 

The  architectures  for  these  three  networks  are  illustrated  in  Fig.  3. 
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The  two-bit  parity  checker  (XOR)  network  htis  two  inputs,  two  hidden  layer  nodes  and  one 
output  node.  An  additional  input  is  added  at  each  layer  whose  value  was  always  unity,  to  serve  as 
a  bias  for  each  node.  The  output  function  S(-)  is  the  sigmoid  defined  in  (3).  The  initial  weights  were 
chosen  as  follows.  Each  weight  in  the  network  was  selected  randomly  from  a  uniform  distribution 
over  the  set  [—1,  1],  This  procedure  was  repeated  100  times  to  select  100  sets  of  initial  weights.  The 
same  100  sets  of  weights  were  used  for  all  3  implementations.  For  the  back-propagation  algorithm, 
a  factor  of  0.04  was  used  in  the  weight  updating  equation.  The  A-S  algorithm  was  implemented  [2] 
using  no  weight  change  constraints.  The  forgetting  factor  for  this  and  for  the  QR  decomposition 
implementation  was  0.98.  The  QR  decomposition  implementation  used  a  weight  constraint  of  0.2. 
Thus  the  weight  vector  associated  with  each  node  was  allowed  to  change  at  most  by  0.2  during 
each  iteration. 

The  four-bit  parity  checker  network  has  four  inputs,  four  hidden  layer  nodes  and  one  output 
node.  bias  input  is  also  added  to  each  layer.  Two  output  functions  were  used.  These  were  the 
same  sigmoid  function  as  above,  and  the  logic  activation  function.  The  logic  activation  function 
is  a  three  piece  piecewise  linear  function.  It  is  zero  at  zero,  has  slope  one  from  zero  to  one.  and 
.slope  zero  everywhere  elsewhere.  This  makes  the  derivative  of  5(-)  easy  to  determine  everywhere 
except  at  zero  and  one  where  it  does  not  exist.  This  does  not  pose  a  problem  in  implementation 
if  we  let  5(a)  =  1  if  a  €  [0, 1]  and  zero  else.  Two  sets  of  100  random  initial  weights  were  used  for 
the  three  implementations.  The  first  set  of  weights  was  random  as  in  the  two-bit  parity  checker, 
and  the  second  set  of  weights  was  as  described  by  Azimi-Sadjadi  et  al.  in  their  paper.  The  .\-S 
method  selects  the  weights  so  that  the  outputs  of  the  network  will  be  between  zero  and  one.  This 
is  done  so  that  the  derivative  wiU  not  be  zero  and  weight  updating  can  take  place.  The  four-bit 
bit  counter  had  four  inputs,  four  hidden  layer  nodes  and  two  outputs.  An  extra  input  was  adihul 
to  each  layer.  The  logic  activation  function  was  used  as  the  output  function.  Two  sets  of  initial 
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weights,  random  and  as  described  by  Azimi-Sadjadi,  were  used.  The  results  are  shown  in  Table  1. 
The  table  shows  the  number  of  times  each  implementation  found  weights  that  solved  the  problem 
for  the  100  initial  weight  sets. 

Simulations  were  also  run  comparing  the  output  error  of  each  algorithm.  In  the  resulting  figures, 
the  error  in  dB  means  the  following:  Let  c(i)  be  the  sum  of  the  squared  errors  incurred  in  iteration 
i  through  the  training  patterns,  averaged  over  the  100  initial  weight  sets.  Then,  plotted  in  the 
figures  is  10  log(£(i)//i)  (dB),  where  ^  is  the  maximum  possible  error  in  any  iteration. 

Fig.  4  shows  the  errors  of  the  three  X-OR  implementations.  Fig.  5  shows  the  errors  of  the  QR 
decomposition  algorithm  using  different  forgetting  factors  and  different  weight  constraints.  The 
number  of  convergences  for  each  of  the  setting  was  78  for  number  one,  60  for  number  two,  56  for 
number  three  and  64  for  number  four.  It  is  apparent  that  the  parameters  which  yield  the  most 
convergences  do  not  necessarily  lead  to  the  lowest  average  error. 

These  results  indicate  a  clear  advantage  for  the  QR  decomposition  strategy.  Algorithmic  differ¬ 
ences  among  the  three  implementations  account  for  performance  differences.  One  difference  is  the 
initialization  needed  for  the  RLS  equations.  With  the  RLS  strategy,  both  the  covariance  matrix 
recursion  and  the  weight  vector  recursion  must  be  initialized  using  theoretically  incorrect  values. 
Because  of  initializations,  the  RLS  algorithm  is  not  guaranteed  to  move  the  estimate  in  the  direc¬ 
tion  of  greatest  decreeise  of  E,  or  even  of  decreasing  E,  for  the  first  iterations.  Of  the  two  RLS 
recursions,  the  weight  recursion  seems  to  be  the  most  sensitive  to  the  initialization  problem.  This 
is  because  P  is  initialized  with  large  values,  F“*  is  small  and  the  effect  of  this  initialization  is 
relatively  small.  The  weight  recursion  is  sensitive  to  initialization  because  (28)  depends  explicitly 
upon  Wk(i.n  -  1).  The  QR  algorithm  has  only  an  implicit  dependence  on  the  weights,  as  do  all 
linearization  algorithms,  because  the  linearizations  depend  on  the  weights. 
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There  is  aJso  a  difference  in  the  performance  of  the  network  using  different  functions  for  5(  ). 
The  logic  activation  function  proved  superior  to  the  sigmoid  in  these  experiments.  This  is  probably 
because  the  error  will  always  be  positive  using  the  sigmoid,  but  can  be  zero  for  the  threshold  logic 
activation  function.  No  matter  how  the  weights  are  adjusted,  the  output  of  the  sigmoid  will  always 
be  bounded  by  one,  so  that  the  training  pattern  outputs  can  never  be  matched  exactly.  With  the 
threshold  logic  activation  function,  once  the  weights  are  adjusted  so  that  the  output  is  off  the  ramp 
(the  linear  region),  the  output  can  be  zero  or  one  in  which  case  the  difference  between  the  training 
output  and  the  actual  output  can  be  zero. 

5.2  Layer  Updating  and  Network  Updating 

This  section  gives  the  results  for  the  algorithms  given  in  Section  4.  The  first  algorithm  updates 
the  network  weights  by  layers.  All  the  weights  in  the  same  layer  are  updated  simultaneously.  The 
second  algorithm  updates  all  the  weights  in  the  network  simultaneously.  Both  algorithms  were  used 
to  train  a  two-bit  parity  checker  (XOR)  network.  The  same  network  architecture  and  the  same 
100  sets  of  initial  weights  as  in  the  previous  section  were  used  in  the  simulations.  The  layer-wise 
updating  algorithm  has  a  forgetting  factor  of  0.3  and  a  weight  constraint  of  1.0.  Thus  the  vector  of 
the  weights  in  each  layer  was  allowed  to  change  by  at  most  1.0  during  any  iteration.  The  network- 
wise  updating  algorithm  had  the  same  forgetting  factor  and  weight  constraint.  The  results  are 
shown  in  Table  2.  As  in  Table  1,  Table  2  shows  the  number  of  times  each  algorithm  found  weights 
that  solved  the  problem  for  the  100  initial  weight  sets. 

Fig.  6  shows  the  errors  of  the  two  X-OR  implementations  of  this  section  and  the  error  of  the 
QR  decomposition  implementation  of  the  single  node  updating  algorithm.  The  convergence  results 
show  the  advantage  of  layer- wise  weight  updating  and  net  work- wise  weight  updating  over  node- wise 
updating.  Layer-wise  weight  updating  also  proved  better  in  the  error  analysis 
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6  Conclusions 


A  new  implementation  of  a  node-wise  weight  updating  algorithm  for  multilayer  perceptrons  and  new 
algorithms  that  update  weights  layer- wise  and  network- wise  have  been  presented  in  this  paper.  The 
QR  decomposition  implementation  has  been  shown  to  be  superior  to  standard  recursive  equations 
for  the  node-wise  updating  algorithm.  This  result  should  prove  to  be  beneficial  not  only  for  tliis 
algorithm,  but  for  all  MLP  training  algorithms  that  use  recursive  equations  for  implementation.  The 
layer-wise  and  network-wise  weight  updating  algorithms  were  developed  to  improve  the  convergence 
rate  and  the  speed  of  convergence.  Both  objectives  were  accomplished,  with  the  layer-wise  weight 
updating  algorithm  showing  a  significant  advantage  over  both  the  single  node  weight  updating 
algorithm  used  as  a  reference  and  standard  back-propagation. 
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Tables  \  2. 


Table  1:  Number  of  convergences  per  100  sets  of  initial  weights 


Impementation 

2in-lout 

4in-lout 

4in-2out 

sigmoid 

logic  activation 

logic  activation  [ 

random 

weights 

random 

weights 

A-S 

weights 

random 

weights 

A-S 

weights 

random 

weights 

A-S 

weights 

Q-R 

78 

5 

5 

51 

57 

1 

16 

Back- Prop 

11 

0 

1 

53 

IQUI 

A-S 

8 

0 

0 

1 

37 

0 
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Table  2:  Number  of  convergences  per  100  sets  of  initial  weights 


Impementation 

2in-lout 

sigmoid 

random  weights 

Layer- wise  updating 

96 

Network-wise  updating 

99 
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Figure  2;  Weight  Estimation  Using  Recursive  QR  Decomposition 
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Weight  Estimation  Using  Recursive  QR  Decomposition 

Initialization:  Initialize  an  (A/  +  1 )  x  (A/  +  1)  working  matrix,  say  A,  to  a  null  matrix. 


Recursion:  For  i  —  1,2,...  (iteration);  and.  For  n  =  1,2. . . . ,  jV  (pattern), 

1.  Enter  the  next  equation  into  the  bottom  row  of  A, 

[  1  hin)  1 . 


l49) 


2.  “Rotate”  the  new  equation  into  the  system  using 

=  Amk<^  +  Asi+i^krS 
+  —  -AmkT‘S  +  A,\1+\,kCrS 

for  k  =  m,m  +  1,...,A/  4-  1  and  m.  =  1,.,.,.\/:  where  ct  =  Ammlp,  t  =  Am^-Jp,  p  = 
(A^^  +  s  is  unity  (useful  later),  and  Amk{A'^i^)  is  the  m,k  element  of  A  pre- 

(post-)rotation.  No  other  elements  of  A  are  affected. 

3.  Solve  for  the  least  square  estimate  of  the  weights  in*  if  desired.  (Solution  after  the  pattern 
will  produce  what  has  been  called  Wk{i,n)  in  the  text,  and  Wk{i,N)  =  Wk(i).) 

4.  [f  n  <  .V,  increment  n.  Otherwise  check  convergence  criterion  and  increment  i  and  reset  n  if 
not  met. 


Termination:  Stop  when  some  convergence  criterion  has  been  met. 
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On  the  Connections  Between  the  Fogel-Huang  and  Dasgupta-Huang 
OBE  Identification  Algorithms 
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Abstract 

Idle  Dasgupta  Fluarig  Optimal  Bounding  Ellipsoid  (OBE)  algorithm  for  identifying  linear  parametric  systems  has 
lieen  [iri'iven  to  i-onverge  under  ordinary  conditions  on  the  model  (listiirbances  This  appealing  property  tiotwitli- 
statiding.  the  algorithm  is  based  upon  an  unusual  optimization  criterion  which  makes  behavior  of  the  tiiethod  diincult 
to  interpret  theoretically.  On  the  other  hand,  the  optitnization  strategy  for  the  original  FogeT  Huang  ORE  algorithm 
IS  .ippe.Hling  in  Its  straightforward  intei^iretability.  but  the  method  suffers  from  the  lack  of  a  clear  understanding 
Ilf  Its  convergence  properties  While  the  underlying  bounded  error  a,ssiim[)iioii  gives  rise  to  both  algorithms,  the 
developments  of  the  techniipies  are  fundamentally  very  different.  However,  this  note  describes  some  interesting 
relationships  between  the  algorithms  which:  1.  provide  theoretical  support  and  interpretation  of  the  optimization 
criterion  employerl  in  Dasgupta-Huang:  ancl.  2.  suggest  that  an  algorithm  with  the  desirable  propertii's  c.f  both 
algorithnus  may  exist. 
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1  Introduction 


Set- membership-based  (SM)  system  identification  algorithms  offer  an  interesting  alternative  to  conventional  tech¬ 
niques.  SM  methods  have  been  receiving  increasing  attention  internationally.  Recent  reviews  of  this  field  are  found. 
for  example,  in  [l]-[3].  This  note  is  restricted  to  the  class  of  algorithms  known  as  optimal  bounding  ellipsoid  (OBE) 
algorithms  which  follow  from  a  bounded  error  constraint.  We  explore  some  interesting  connections  which  exist  lie- 
tween  two  landmark  OBE  algorithms  -  the  Fogel-Huang  (F-H)  [4]  and  Dasgupta- Huang  (D-H)  [5]  OBE  algorithms 
which  have  not  been  well  appreciated.  These  connections  suggest  the  possibility  that  the  desirable  properties  of 
both  may  be  blended  into  a  single  OBE  algorithm. 

The  bounded  error  identification  problem  is  as  follows:  Assume  that  we  are  observing  some  physical  system 
which  is  generating  sequence  y(  )  S  iti  respon.se  to  input  «(•)  £  C' .  u(  )  is  a  realization  of  an  ergodic.  wide  sense 
stationary  stochastic  process.  Both  input  ami  output  sequences  are  measurable.  We  assume  the  existence  of  a  "true  " 
model  of  form 

y(ti)  =  0f^x(n) -t- £,{n)  (1) 


in  which  x(ti)  is  some  m-vector  of  functions  of  p  lags  of  y(  )  and  q  lags  plus  the  present  value  of  it{  ).  and  where 
is  the  realization  of  a  zero-mean,  second  nioment  ergodic,  complex  vector-valued  random  .sequence 
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whose  components  are  independent.  The  matrix  0.  €  parameterizes  the  model.  At  time  n  we  wish  to  use  the 

observed  data  on  <  €  [1.  n]  to  deduce  an  estimated  model  of  the  same  form.  The  parameter  estimate  is  denoted  by 
0{n)  and  the  residual  process  by  £(•.  0(nl)  The  dependence  of  the  residual  upon  the  parameter  estimates  is  highly 
significant,  so  it  is  slmwn  explicitly. 

Deller  el  al.  [3]  have  recently  shown  that  all  reported  OBE  algorithms,  including  F-H  OBE  and  D-H  OBE.  can 
be  unified  into  a  general  framework  which  they  call  the  I'nified  OBE  (i'OBE)  algorithm.  We  initially  present  the 
rOBE  framework:  I'OBE  algorithms  arise  from  a  bounded  error  constraint: 


II  c.('t)  ll‘<  al”). 


where  -.(  .1  is  a  known  positive  sequence.  At  time  n.  a  set  of  parameters  can  be  found  which  are  consistent  with  the 
ol)servations  and  this  sequence  of  bounds.  The  exact  set  is  difficult  to  descrilie  and  track,  but,  in  conjunction  with 
neighltd  rerursire  least  square  (iVRLS)  processing  (e  g,  [6,  7]).  12(n)  can  be  shown  to  be  cotitained  in  a  superset  of 
the  form  (e.g,  [3], [4], [8]) 

n(n)  =  (0|  tr{[0-0(u)]"^[0-0(n)j}  <  ll  (3) 

I  «('i)  J 

where  t  r  {  }  denotes  the  trace  of  a  matrix,  0(  u )  is  t  he  WR  LS  parameter  estimate  at  t  ime  u  using  weights  -\„  ( 1 ) . An(  u  ). 

C{n)  is  the  weighted  covariance  matrix,  and  a-{n)  is  the  scalar  <)uantity 

n 

n{n)  =  tr{0'^u|C(u)0(u)}  -4- ^  An(n  Ij  y(t)  l|-’]  .  (4) 

1  =  1 

u  )  IS  a  hyperellipsoid  m  C'^'  .  with  its  center  at  0(  u  ).  By  examining  a  single  output  say  ;/,  ( • ).  t  he  eoiiiponent 

of  yi  i  we  see  that  a  common  "ellipsoid  matrix"  C(n)/s"(n)  is  shared  by  each  of  tfie  individual  outputs,  but  that 
each  IS  i-cntered  on  a  different  parameter  estimate  refiresented  by  column  i  of  0(  ),  We  conclude  therefore  that  under 
boundecl  error  c(  aist  raints.  a  hyperellipsoid  can  be  associated  with  a  WRLS  recursion  and  converselv. 
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The  subscript  “n”  on  the  weights  Xn{  )  is  used  to  indicate  that  the  weights  may  be  dependent  upon  the  time 
of  estimation.  In  general,  time  dependent  weights  are  not  easily  integrated  into  WRLS  algorithms  except  in  simple 
cases.  One  such  case  occurs  in  the  UOBE  algorithm  in  which  the  weights  are  time  varying  by  virtue  of  a  seating 
procedure.  The  weights  used  at  time  ri  are  given  by 

An(0  =  ^7-— for  /  <  n  -  1  .  (5) 

and  A„(n),  where  <,'(•)  is  a  positive  scaling  sequence.  We  make  the  reasonable  assumption  that  the  sequence  (,'(  ) 
is  "causal"  in  the  sense  that  C,'(n)  does  not  depend  upon  any  quantities  not  available  at  time  n.  The  method  for 
integrating  scaled  weights  into  WRLS  is  given  inherently  in  [4]  and  [8],  and  explicitly  in  [3]  and  [9].  While  the  weights 
are  directly  related  to  the  size,  orientation,  and  location  of  the  ellipsoid  in  the  parameter  space,  this  scaling  procedure 
rffectirrly  restricts  to  one  (  viz.  Ar,(n)j  the  number  of  free  parameters  available  to  control  the  bounding  ellipsoid  at 
time  n.  The  central  objective  of  the  I’OBE  algorithm  is  to  employ  the  weights  in  the  context  of  WRLS  estimation 
to  sequentially  minimize  the  ellipsoid  size  in  some  sense.  A  significant  benefit  is  that  often  no  weight  exists  which 
can  minimize  the  ellipsoid,  indicating  that  the  incoming  data  set  is  uninformative  in  the  SM  sense. 

.\11  LOBE  algorithms  adhere  to  the  following  steps:  At  time  n, 

1.  In  conjunction  with  the  incoming  data  set  (y{n).x{n)).  find  the  weight,  say  A’(n),  which  is  optimal  in  some 
sense  (.see  below): 

2.  Discard  the  data  set  if  A‘(n)  <  0. 

3.  Epdate  C(n)  and  0{n)  using  some  version  of  WRLS  (e.g.  see  [8]). 

4.  Epdate  K(n)  using  (4)  or  one  of  the  recursions  in  [3], 

Three  fundamental  variations  on  the  EOBE  method  have  been  reported  in  the  literature.  The  most  reriuit,  the 
D-H  OBE  algorithm  [.5],  is  unlike  the  others  in  one  important  aspect.  This  difference  lies  in  the  criterion  used  for 
determining  optimal  weights.  This  difference,  on  one  hand,  allows  for  a  proof  of  convergence  of  the  ellipsoirl  in  a 
certain  sense.  On  the  other  hand,  the  optimization  criterion  used  is  controversial  and  .somewhat  difficult  to  interpret. 
Further,  the  usual  optimization  criterion  .so  profoundly  changes  the  development  of  the  algorithm,  that  its  identity 
;vs  a  member  jf  the  EOBE  class  of  algorithms  has  not  been  appreciated. 

The  other  two  reported  OBE  methods  are  the  F-H  OBE  algorithm  [4]  (C( " )  =  '^'1  '*))  ^tid  the  SM-  WRLS  algorithm 
cif  Deller  et  al.  [8], [10]  (C(n)  =  1).  Variations  on.  and  enhancements  to.  each  of  these  algorithms,  as  well  .as  D-11 
OBE.  are  found  in  the  literature  (e.g.  [11] — [14]).  The  stated  purpose  of  this  paper  is  to  make  connections  between 
D-H  OBE  and  F-H  OBE.  However,  the  important  contrast  exists  lietween  D-H  OBE  and  any  EOBE  algorithm  with 
■'conventional  optimization"  based  on  a  meaningful  set  measure  as  described  below.  Let  us  refer  to  the  latter  class 
I  if  algorithms  .as  EOBE-/i,  and  generalize  the  discussion 

I  OBE-p  algorithms  operate  on  the  optitnization  principle  of  (prospectively)  minimizing  some  set  measure  of 
Olu).  say  /i{l>(u)}.  Fogel  and  Huang  [4]  suggest  two  set  measures.  The  first  is  the  determinant  of  the  inverse 
ellqisoid  matrix 

/i,  {0(u)}  ='  det  {>f(n  )C~'(n)}  Hi) 
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and  the  second  is  the  trace. 


//,{Q(n)}  =  tr  '(n)}.  (7) 

(We  shall  henceforth  write  /i,  (n)  and  /<((«)  for  simplicity.)  In  the  single  output  rase  in  which  is  clearly  in- 

tepretable  as  an  ellipsoid,  pt  ( ti )  is  proport  ional  to  t  he  sipiare  of  the  volume  of  ihe  ellipsoid,  while  /i,(  u )  is  proportional 
to  the  sum  of  squares  of  its  semi-a.\es.  The  same  two  measures  are  meaningful  in  the  multiple  output  case,  since  they 
result  in  the  minimization  of  the  volume  or  trace  of  the  roviinon  ellipsoid  shared  by  all  the  outputs  (see  discussion 
below  (4)). 

The  general  method  for  finding  the  rOHE-/i  opimitd  weight  for  minimizing  the  either  set  measure  is  given  in  [4]. 
These  methods  include  results  for  F-H  OHE  ami  .S.\1-WRLS  as  special  cases,  but  optimization  strategies  are  also 
given  of  course  in  the  original  papers  It  is  found  that  A'(u)  is  the  unique  positive  root  of  the  polynomials  F,  (A) 
and  F((A)  for  the  volume  and  trace  measures  respi'ciively.  where  F,  is  a  quadratic. 

/■ ,  ( A )  =  iio A"  +  It  1  A  +  iin  .  ( ‘'' ) 


and  Ft  is  a  cubic  polynomial 

/■  t ( A )  ~  /taA*^  4-  fts A"  *4"  f>i  A  4-  6r)  .  ( 9 ) 

The  coefficients  ni  and  6,  are  given  in  terms  of  <(uantities  which  are  known  prior  to  time  n.  and  which,  in  turn, 
are  dependent  upon  the  scaling  seejuence  s(  )  I  he  interesting  feature  of  the  l  OBE-p  algorithms  is  the  infrequent 
e.\istenre  of  the  optimal  weight  leading  to  infre<|uent  updating  of  the  parameter  estimates.  This  reduction  in  the 
need  for  updating,  in  turn,  results  in  computational  efficiencies  and  interesting  performance  properties. 

(n  contrast,  the  D-H  ORE  algorithm  uses  scale  factors  s(u)  =  (1  ~  'T'(u))“’.  and  an  optimization  procedure 
which  does  not  seek  to  directly  minimize  a  .set  measure  on  Q(m)  such  as  (b)  or  (7).  Rather,  the  weight  is  chosen  to 
minimize  K{n].  subject  to  the  constraint  that  it  be  in  the  allowable  range  [0.t>]  with  0  <  t>  <  1  (.see  below).  The 
chr)ice  rif  scaling  sequence  results  in  the  covariance  matrix  at  time  n. 

C(r!)  =  (1  -  \'^(ri))Cin  -  1)  +  A"  j-(n)j-"(n)  (  10) 

which  is  seen  to  be  a  convex  combination  of  C(n  —  1)  and  the  new  data  outer  product.  Here  we  see  the  reason 
for  the  constraint  on  the  range  of  optimal  weights.  This  cfinstruciion  provides  the  means  with  which  to  prove 
asymptotic  and  exponential  convergence  of  the  ellip.snid.  and  cessation  of  updating,  using  Lya[)onov  theory  I'pon 
convergence,  the  residuals.  e{  .&{■))  are  gviaranteed  to  remain  in  the  "dead  zone"  indicated  by  the  error  bounds, 
i.e..  Imi , — ,  li  e{t..0(n))  ||->  ~{t). 

n  — »  x: 

Dasgiipta  and  Huang  [-a]  show  that  such  an  weight  optimal  weight  in  the  sense  of  minimizing  K'(ti)  e.xists  itf 

;■  ■  (  n .  ^(  n  —  1 ) )  >  ■■  I  n  )  —  S' ,  ( II  —  I )  .  (11) 

where  K,(ii  —  1)  =  s'(n  —  l),'C(n  ^  1  I-  the  "scaled''  value  of  the  K  parameter.  .Accordinglv  *!iis  simple  and  ci'uiiputa- 
fionally  inexpensive  test  may  be  employeil  to  determine  whether  the  the  current  data  set  (y(ii).  J'(ii))  ts  useful  in  the 
sense  of  the  optimization  criterion  However,  whether  this  goal  of  minimiztng  Kin)  is  meaningful  remains  an  issue 
of  controversy.  From  an  analytical  point  .if  view,  the  reason  fi.u  this  choice  is  that  kIii)  is  a  bound  on  the  I.yaputiov 
function  used  in  the  mintmizaf ion.  a:;d  the  convergence  ..f  i(ie  I.yapuiiov  function  is  used  to  prove  l  oiivergenc.  of 


the  algorithm.  From  an  interpretive  point  of  view,  however,  diminishing  K{rt)  is  not  helpful  because  its  magnitude 
is  not  clearly  related  to  the  "size'’  of  the  set  l>(n).  Dasgupta  and  Huang  [5]  argue  simply  that  K{ti)  is  "a  bound  on 
the  estimation  error."  and  should  be  minimized.  Norton  and  Mo  [15]  dispute  this  claim  with  the  observation  that 
■■[tninimizing  rc(ti)]  is  claimed  in  [5]  to  minimize  a  bound  on  the  estimation  error,  but  the  iiuaiitity  minitiiized  is  not 
a  boutid  on  the  parameter  error,  tior  does  it  bear  a  simple  relation  to  it.  " 

Hence,  we  have  arrived  at  the  apparent  philo.sophical  and  practical  dilemma  which  initiated  this  discussion. 
When  faced  with  the  choice  of  OBE  algorithms,  iloi's  one  opt  for  the  D-H  OBE  method  with  its  proven  convergence 
jiroperties,  or  the  rOBE-/i  (including  F-H  OBE)  algorithms  with  their  clear  interpretation'’  In  the  following  sect  ion, 
we  demonstrate  some  heretofore  unrecognized  connect  ions  between  the  methods  which  provide  a  better  basis  for 
making  this  choice. 

2  Connections  Between  the  D-H  OBE  and  UOBE-//  (F-H  OBE)  Al¬ 
gorithms 

In  .s|)itt'  of  some  statements  to  the  contrary  in  the  literature  (based  on  apparent  misiinderstandings  of  the  origiii.al 
F-H  pa(ier  [1]).  there  is  no  known  proof  of  c(ui vi'rgence  of  the  F-H  OBE  algorithm  according  to  any  reasonable 
criteria  In  particular.  F-H  OBE  is  not  known  to  converge  in  the  sense  described  for  D-H  OBE.  These  statements 
a()ply  to  rOBE-/(  algorithms  in  general.  We  shall  not  pursue  such  a  convergence  result.  Rather,  we  shall  show  that 
some  of  fill'  interiiretabilify  "  cif  COBE-//  algorithms  may  be  "transferred"  to  D-H  OBE. 

It  is  somewhat  curious  that  we  have  called  the  D-H  method  an  "OBE"  al.gorithm.  The  bounditig  ellipsoid  clearly 
underlies  the  process,  but  its  use  in  the  optimization  proc  'dure  is  obscure.  Herein  lies  the  crux  of  the  problem  with 
interpretation.  While  it  is  not  exploited  in  the  D-H  OBE  algorithm,  the  D-H  hypcrellipsoid  nevertheless  doc.s  have 
voluiiii'  and  trace  set  me.asures  at  each  n.  Iti  fact,  because  D-H  OBE  is  fundamentally  a  I'OBE  algorithtn.  the  utiiipn' 
positive  roof  of  (,s),  ifit  exists,  will  minimize  /i,  ,  A  similar  statement  applies  to(!))  aiul  /i[.  The  utility  of  this  volume 
or  trai-e  result  remains  an  open  ipiestion  at  this  point  in  our  <liscussion.  because  to  use  weights  which  are  optimal  in 
these  ■■|■oll vent loiial"  senses  does  not  necessarily  admit  the  ronrirgenct  results  obtained  by  the  D-H  analysis. 

Flic  connection  between  the  two  methods  rests  fundamentally  in  the  zerri  order  coethrients  un  and  /t.)  o|  the 
eiptmiization  polynomials  (><)  aiul  (9).  File  following  result  has  been  shown  in  [dj: 

Theorem  1  The  'inadratir  F,  {\)  as.soriatfd  inlh  the  I  OBE-n  algorithms  has  a  uni(|ue  pusitirr  root  (thr  opitmal 
u  fight.  \'lti)l  i/f  un  <  0  .  .similarh/.  rnhir  rgnation  F((5)  has  a  uni(|ue  po'itirr  root  iff  bo  <  (I. 

Fcf  us  henceforth  restrict  our  attention  to  the  volume  minimization  case  with  the  understanding  that  a  iiaralh  l 
discussion  applies  to  the  trace  measure.  The  (-..efficient  u.a  is  given  by  [5] 

iio  =  iiik  [-■(  u  )—  a  £!  u .  ti  —  I )  j  !•-]  —  s',(  u  —  I  )f ' 't )  I  !  '-M 

wh'-re  ( ,  I  n  i  =  r  tiiC  F  '  I  u  —  I'lr  ( ti ).  C,  i  ti )  ='  Cl  rt )  ./^l  ti ).  and  all  ot  hc-r  .plant  it  iea  h.ave  bi'eii  .hdine.l  ab.  iv.-.  N.  a  ing 
that  the  .omfuifati.  in  ..f  this  .(uantity  is  ti'-arly  as  com|)utationally  exf'en.sne  as  simidy  including  the  in-w  .fata  in 
the  (  Stimat.  .  Deller  and  O.leh  [I2’j.[l-lj  suggest  the  siibuptimal  testing  procedure  in  which  the  new  .lata  are  us.  .1  ilf 

£(  tt  n  —  1)  j  .1-  >  -■(  n ).  I  1 :: ! 
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While  originally  developed  using  a  diirerent  argument,  it  is  seen  that  the  test  (Id)  has  a  very  useful  interpretation 
in  terms  of  the  "proper  ”  test  ao  <  0.  In  fact,  ( Id)  is  equivalent  to  testing  whether 

Iln  +  /\  1  <  0  (14) 

where  l\i  is  the  hist  term  in  (12).  .Since  /\'i  >  IJ  [d].  an  optimal  weight  in  the  sense  of  diminishing  i.i,_:(n)  will  always 
exist  if  the  suhoptimal  test  (Id)  is  satisfied.  E.xperimental  studies  have  shown  that  the  UOBE-/i  algorithm  with 
suboptinial  testing  performs  as  well  as  the  ..iptimal  algorithm  in  terms  of  tracking  and  (empirical)  convergence,  while 
freiiuently  using  significantly  fewer  data  (e  g,  .see  [12]. [13]). 

Let  us  now  e.Kamine  the  test  employi'd  l>y  Dasgiipia  and  Huang,  given  in  (11).  Recall  that  this  test  is  designed 
to  determine  whether  an  optimal  weight  exists  in  the  sense  of  minimizing  K{ri).  say  A"  ^(ti).  However,  in  light  of  the 
developments  above,  the  D-H  test  may  also  be  seen  to  be  a  suhoptimal  test  for  the  existence  of  an  optimal  weight  in 
the  sense  of  diminishing  /i,  ,  say  A"  ,  (n).  l:i  fact,  the  D-H  test  is  equivalent  to  testing  whether 

Uii  4“  fv  2  ^  d  (  1  '^ ) 

where  AA  =  K,{n— 1)  [{G ,{n)/ rtik)  —  1],  I 'iifortunately.  the  truth  of  ( Id)  is  not  sufficient  to  assure  that  un  <  0  Itecause 
I\s  is  not  necessarily  positive.  If  it  were  additionally  known  that  G.,{ii)  >  ink,  then  (15)  would  be  a  sufficient  test. 
However,  because  of  the  weighting  strategy  used  m  D-H  OBE.  there  is  no  reason  to  believe  that  this  latter  condition 
holds  ill  general.  .So  the  D-H  test  comes  itit riguingly  close  to  being  a  check  for  the  existence  of  A*  ,(n),  but  falls 
somewhat  short.  Nevertheless,  the  D-H  test  can  be  interpreted  as  a  suhoptimal  test  for  the  existence  of  an  optimal 
"volume  weight." 

To  summarize  t  he  result  above,  at  time  ii ,  there  is  an  optimal  weight  in  the  sense  of  minimizing  //.,  ( n  V  A'  ,  ( n  )  >  I). 
associated  with  the  D-H  ellipsoid  The  test  (11)  is  a  suboptinial  check  for  the  existence  of  a  positive  value  (T 
A'  ,(n).  The  relevance  of  this  result  for  D-H  OBE  in  which  k.  not  //,.,  is  minimized,  is  as  follows.  While  generally 
A"  ^(n)  yh  A'  ,(n),  roughly  speaking.  A"  ^(n)  will  be  positive',  and  the  data  used,  only  when  the  volume  ran  be 
minimized.  While  A",  ^(n)  is  not  designed  to  minimize  /i,  ,  it  will  still  diminish  the  volume,  though  not  optimally.  In 
turn,  this  fact  is  a  consecmence  of  the  following,  which  can  be  inferred  from  the  work  in  [3]: 

Theoroiii  2  For  any  I'OBE  algorithm,  if  •(")  >  d-  iGn  if  any  positire  weight  is  used.  fi,(n)  ^  /t,(u  —  1). 

( 'onse(]uently,  to  the  extent  that  (11)  is  a  useful  test  for  a  piisilive  A‘  ,(n).  it  can  be  stated  that  the  D-H  OBE 
.dgnrithm  diminishes  the  volume  at  each  sti'p  in  the  process  of  minimizing  s'. 

The  arguments  above  are  not  rigorous  because  (  1  1 )  is  not  an  exact  check  for  a  posit  ive  A*  ,  ( n  ) .  Further,  a  Igor  it  bins 
which  use  different  weights  can  I'lnly  be  comparefi  "locally."  that  is.  at  a  given  n.  Ib'iwever.  it  seems  intuitive  that  if 
an  algorithm  which  siibciptimally  dimmi'hes  volume  at  eacti  n  can  be  sluiwn  to  converge,  then  a  covergent  algcinthm 
which  opt iinally  diminishes  /i,  could  be  .iemonst rated .  By  showing  t  hat  D-H  (f  BE  and  EOBE-/;  (  F-H  OBE  )  are  more 
similar  than  had  previously  been  underst. .oil .  this  discussion  suggests  that  s.icfi  an  algorithm  may  lie  forthcoming 
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ABSTRACT 

A  newly  modified  set-membership  idgonthm  is  introduced.  It  is  sbow-n  that  the  forgetting  coviiriance 
updating  in  conjunction  with  minimum  volume  data  selecting  strategy  result  i.i  a  landmark  performance  level  in 
system  identification,  A  suboptimaJ  test  for  data  selection  is  introduced  which  is  compuiatitmally  efficient. 
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1.  INTRODUCTION 


The  behavior  of  optimal  bounding  ellipsoid  algorithms  (OBE)  such  as  Fogel-Huang  and  SM-WRLS  has 
attracted  the  attention  of  prominent  researchers.  While  the  strong  performtince  of  these  algorithms  can  not  be 
disputed,  there  are  no  ^"pponing  theoretictil  proofs  for  their  converging  behavior.  Dasgupla  :md  Huang  [4]  intro¬ 
duced  an  OBE  algorithm  whose  convergence  was  shown  theoretically  by  employing  a  Lyapunov  technique. 
However,  the  data-selcction  strategy  implemented  in  this  algorithm  is  the  center  of  some  debate.  This  originates 
from  a  controversial  criterion  used  to  measure  the  performance  of  the  itlgorithm  if  new  data  are  to  be  .selected. 

In  this  paper,  we  introduce  a  modified  SM-WRLS  algorithm  whose  convergence  will  be  proven  in  the 
most  general  system  identification  setting.  This  algorithm  employs  the  covariance  updating  used  by  Dasgupta 
and  Huang,  and  selects  the  incoming  data  according  to  SM-WRLS  volume  minimization  strategy. 


2.  MODIFIED  SM-WRLS  ALGORITHM 


Since  the  Modified  SM-WRLS  algorithm  is  also  an  OBE  ,.!gorithm,  it  adheres  to  the  general  steps  outlined 
in  [S].  To  address  the  update  recursions  ;md  daia-seleciion  strategy,  let  us  adopt  similar  notation  used  in  [8]. 


Update  Recursions:  The  modification  of  the  SM-WRLS  Covariance  updating  to 

C(rt)  =  {l-X„)C(n-l) -t- (1) 

wi  ■  0  <  <  1  ,  has  a  profound  effect  on  other  measures  employed  in  SM-WRLS  as  indicated  below,  where 

}  in)  =  C~Un ); 


Pin)  = 


P(n-l)  - 


P(n-\)x„xJP(n-\) 


I  A,,,  ^ 
e,  =  9„-i  +  X„  P(n).K^e„ 
Ef,  =  1^  “9 


(2) 


K*,i  —  f  1  Xfj  )  “l"  Y/j 

0'„  =  .i;fF(n-l),r„ 


^  Xn  )  ^  n 

I  -X„  +X„G„ 


Data  Selection:  Selection  of  the  incoming  data  involves  the  minimization  of  the  hounding  ellipsoid's  volume, 
with  respect  to  the  weights  X„ .  The  optuniil  weight  denoted  by  Xj,  is  given  by 

=  max((),A„„)  (3) 

where  is  the  larger  root  of  the  following  quadratic  equation; 

«,  X-  +  a^X  -t-  a,.  =  0  (4i 


where 

a,  =  my.  -  me;  -s  -  2mG,Y„  -  K„_|G'„  +  k,_,G„-  G'„y„  -  GVy^  -  e;G„ 

oti  “  ^mEfj  —  2m  —  2/uG„ Yfi  ^  ~~  k„_iG,]  —  G„  {„  +  e„G„ 

CL,  =  riiY,  -  me;  - 


Theorem  1:  Equation  (4)  has  at  most  one  positive  root  in  which  case  the  selection  of  the  data  point 
guarantees  the  shrinkage  of  the  bounding  ellipsoid  in  volume.  Moreover,  this  positive  root  lies  in  the 
interval  (0,1). 

This  result  implies  that  the  data  must  be  disciu-ded  if 

^>0 

«2 

Also,  noting  that  (1)  represents  a  convex  combination  of  C(n)  and  x„xj  matrices.  Theorem  1  suggests  that  as 
long  as  a  positive  rix)t  to  (4)  is  found,  no  extra  monitoring  is  needed  to  satisfy  the  0  <  <  1  condition.  This  is 

a  very  convenient  data  selection  strategy. 


3.  CONVERGENCE  OF  THE  ALGORITHM 

This  modihed  SM-WRLS  txlgonthm  shows  an  attracting  convergence  behavior  as  given  by  the  following 
Theorem. 

Theorem  2:  Let  us  assume  that  the  noi.se  process  v(n)  is  persistently  exciting  with  pointwise  energy 
bound 

v-(n)<Yn  (5) 

Then,  the  modified  SM-WRLS  shows  convergence  in  the  following  sense: 

(1)  lim  iie„-e‘ii  =  0 

n 

(2)  lim  =  0 

n  -4'^ 

(3)  lim  e„-€  [0,yJ 


Example: 

Let  us  consider  a  simple  AR(2)  mo'Jel  given  by 

yin  )  =  a  iy(/i-l)  +  a^A-in-l)  v(n ) 

where  a^  =  1.6  luid  a  2  =  -0.68  and  \in)  is  a  white  noise  sequence  where  =  0.5  in  (5).  Figure  1  clemly  indi¬ 
cates  the  asymptotic  convergence  of  the  parameters  to  the  desired  values.  .Also  shown  in  Fig.  2  is  the  volume  of 
the  bounding  ellipsoid  as  more  daUi  is  selected.  The  asymptotic  convergence  of  the  oplimtd  weight  to  zero  is 
shown  in  tig.  3.  Figure  4  indicates  that  in  the  limit  will  be  equivalent  to  the  input  noi.se  vin ),  as  expected. 


4.  SUBOPTIMAL  TEvST 

There  are  some  common  features  of  SM-WULS  that  are  inherited  by  the  modified  version  one  of  which  is 
the  feasibility  of  the  stune  suboptimal  test  shown  in  |3|  ;ind  18).  This  is  so  because  of  the  rcsembkuicc  of  the  a, 
coefficient  in  i4)  to  its  counterpart  in  conveniiontil  SM-WRLS.  Figure  5  shows  the  relative  computation.il 
requirements  for  the  optimtil  and  suboptimal  tests  associated  with  this  algorithm. 
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Fig.  4.  E„-  vs.  n  for  the  identification  of  AR(2)  model  in  the  cxtunple. 
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Abstract' 

This  paper  is  concerned  with  the  convergence  and  bias,  proper¬ 
ties  of  a  general  class  of  optima/  Aoundin^  •ilijjs.ni  algo¬ 

rithms.  OBE  algorithms  are  sct-membfrshtp  >.\f  based  identi¬ 
fication  algorithms  which  are  applied  to  models  whjch  .irc  liiiear- 
in-paxameters.  and  are  closely  related  to  weighted  recursive  least 
square  error  (  WRLS)  methods. 

1.  Introduction  and  Formalities 

This  paper  wiU  generalize  th^  basic  OBE  identification  prob¬ 
lem  with  respect  to  existing  publications  in  several  ways.  First,  a 
complex  signal.  MIMO  system  is  considered.  This  treatment  sub¬ 
sumes  common  parametric  models  such  as  the  SISO  .\RX  model 
(«  g-  [*])  as  special  cases.  Secondly,  a  tinified  OBE  l.’OBE  -  algo¬ 
rithm  is  developed  which  contains  ail  reported  OBE  algorithms, 
both  adaptive  and  nonadaptive.  as  special  cases. 

.Assume  that  we  are  obsciving  some  physical  system  which  is 
generating  sequence  v(  )  €  C*  in  response  to  input  ul  i  6 
u(  )  is  a  realization  of  an  ergodic.  wide  sense  stationary  stochastic 
process.  Both  input  and  output  sequences  are  measurable.  We 
assume  the  existence  of  a  true  ”  model  of  form 

y(n)  =  0^^  x(nl  -r  '.(ri)  (t) 

in  which  r(  n)  is  some  m- vector  of  functions  of  p  lags  of  yi  )  and 
i  lags  plus  the  present  value  of  u(  ).  vand  where  ?•(  )  6  is 
the  realization  of  a  zertvmean.  second  moment  ergodic.  complex 
vector- valued  random  sequence  whose  components  are  indepen¬ 
dent.  The  matrix  0*  par<uncterizes  the  model.  .At  time 

n  we  wish  to  use  the  observed  data  on  t  ^  [l-^j  to  deduce  an 
estimated  model  of  the  same  form.  The  parameter  estimate  is 
denoted  by  0(n)  and  the  residual  process  by  f(n.  0(n))  The 
dependence  of  the  r'»sidual  upon  the  parameter  estimates  is  highly 
significant,  so  it  is  shown  explicitly. 

In  -almost  ail  S\f- based  techniques,  a  f^asxhU  pararnedr  sft 
arises  from  direct  or  indirect  constraints  on  the  additive  error  se¬ 
quence.  J  QBE  algorithms  arise  from  a  bounded  error  constraint: 

II  5«(n)  )p<  -yfri).  (2) 

where  *>(  )  is  a  known  positive  sequence.  At  time  n.  a  sci  of  pa¬ 
rameters  can  be  foimd  which  are  consistent  with  the  observations 
and  this  sequence  of  bounds  The  exact  set  is  difficult  to  describe 
and  track,  but.  in  'onjunction  with  WRLS  processing,  it  can  be 
shown  to  be  contain'^d  in  a  superset  of  the  form  I'*  g 

ni’-.)  =  I  e  !  tr<[  e  -  0  -  <  I  j  (3) 

where  tr{  ^  denot«»s  the  trace  of  a  matrix.  is  the  WRLS  pa¬ 

rameter  est  imate  at  time  n  using  weights  *  n  (1 )  •  -  -  'n  I  1-  n) 

^  This  work  was  supported  by  the  Office  of  Naval  R<“ssearch  un¬ 
der  Contract  No  N00014-91-J-1329  and  by  the  National  .Science 
Fruindation  under  '  irant  No.  \llP-90loT34 


is  the  weighted  covariance  matrix,  and  «(n)  is  the  scalar  quantity 

'v(n)  =  K'(n)  -t-  ^  Vn(0  [^(0-  II  y(0  11^]  .  '-I) 

(=1 

with  A.'(n)  ='  tr(  &^{n)  C(n)  ©(n)}.  We  shall  refer  to  Ofn)  -as 
a  hypereilipsoid'*  in  with  its  ‘center”  at  ©(n).  Indeed, 

if  all  quantities  are  real,  and  m  =  2  and  k  =  i,  this  set  forms 
an  ellipse  in  .  By  examining  a  single  output  -  say  Vi(  ).  the 

component  of  y(  •)  -  we  see  that  a  common  “ellipsoid  matrix” 
C(n)/^(n)  is  shared  by  each  of  the  individual  outputs,  but  that 
each  is  centered  on  a  different  parameter  estimate  represented 
by  column  i  of  ©(•).  We  conclude  therefore  that  under  bounded 
error  constraints,  a  hypcrellipsoid  can  be  associated  with  a  WRLS 
recursion  and  conversely. 

The  subscript  “n”  on  the  weights  Af,(  )  is  used  to  indicate  that 
the  weights  may  be  dependent  upon  the  time  of  estimation.  In 
general,  time  dependent  weights  are  not  easily  integrated  into 
WRLS  algorithms  except  in  simple  cases.  One  such  case  occurs 
in  the  LOBE  algorithm  in  which  the  weights  are  time  varying  by 
virtue  of  a  simple  scaling  procedure.  The  weights  used  at  time  n 
are  given  by 

i  ^  for  f  <  n  -  1.  (31 

:{n-\) 

and  \n(H).  where  C(-)  is  a  positive  scaling  sequence.  When  Cl  is 
independent  of  V,(  for  all  t .  ;  >  n.  then  we  shall  C2dl  these  simply 
scilfd  weights.  The  method  for  integrating  scaled  weights  into 
WRLS  is  given  inherently  in  [4]  and  [6],  and  explicitly  in  [5]  and 
[9).  While  the  weights  are  directly  related  to  the  size,  orientation, 
and  location  of  the  eUipsoid  in  the  parameter  space,  tkis  s'diing 
prorfdurf  -^ffrctively  restricts  to  one  (  viz.  \n(^).'  'kr  ’lumi-'r  'f 
frf  parameters  available  to  ■'ontrol  the  bounding  ellipsoid  at  time 
n.  The  central  objective  of  the  L  OBE  algorithm  is  to  ^unploy 
the  weights  in  the  context  of  W  RLS  estimation  to  sequentially 
minimize  the  ellipsoid  size  in  some  sense.  .A  significant  benefit 
is  that  often  no  weight  exists  which  can  minimize  the  ellipsoid, 
indicatuig  that  the  incoming  data  set  is  uninformative  in  the  SM 
sense. 

.All  bounding  ellipsoid  algorithms,  both  adaptive  and  nonadaf> 
tive.  adhere  to  the  following  steps.  Consequently,  we  call  this  set 
of  operations  the  L'nified  Optimal  Bounding  Ellipsoid  I 'QBE: 
algorithm:  .At  time  n, 

1.  In  conjunction  with  the  incoming  data  set  {  y(n).  r!'i)).find 

the  weight,  say  which  is  optimal  in  some  sense  (see 

below  j: 

2.  Discard  the  data  set  if  (n|  <  0. 

3.  L'pdate  and  0{*t)  using  some  version  of  WRL.'^  te 

see  ^4] ) . 

4.  ^  pdate  -sin)  using  (4)  or  one  the  recursions  in  [s] 

Three  fundamental  variations  on  the  L  OBE  method  have  he^n 

reported  m  the  literature.  The  most  recent,  due  to  Dasgupta 
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Huang  ("D-H  OBE")  [2],  is  unlike  all  the  others  in  rertain 
respects.  From  the  present  point  of  view,  one  of  the  key  differences 
is  that  the  weight  pattern  follows  (5),  but  the  weights  are  not 
itm-ply  scaled  according  the  definition  above.  These  differences, 
on  one  hand,  allow  for  a  proof  of  convergence  of  the  ellipsoid  in 
a  certain  sense  and  make  the  analysis  in  this  paper  seertungly 
unnecessary  On  the  other  hand,  the  optimization  criterion  used 
is  controversial  and  somewhat  difficult  to  interpret.  Space  does 
not  permit  elaboration  upon  D-H  OBE,  and  no  precise  coruiection 
between  this  method  and  more  'conventional"  OBE  algorithms 
exists  in  the  literature,  Hc'vce,  the  analysis  in  this  paper  is  not 
apparently  related  to  D-H  OBE.  However,  interesting  connections 
it)  exist  and  these  will  be  the  subject  of  a  forthcoming  paper  [8j. 

•Conventional"  OBE  algorithms  operate  on  the  optimization 
principle  of  (prospectively)  minimizing  some  set  measure  of  ill  >i). 
say  u{0(n)}.  For  the  SISO  case.  Fogel  and  Huang  [6]  suggest  two 
set  measures.  The  first  is  the  determinant  of  the  inverse  ellipsoid 
matrix " 

wv  {fl(n)  }'*.=' det  {)s(n)  ( )i) )  lb) 

and  the  second  is  the  trace, 

Mt{n(n)}  =‘  tr  {a(n)  C-‘(n)).  (T) 

(We  shall  henceforth  write  Mv(n)  and  4f(n)  for  simplicity.)  In  the 
MISO  case  in  which  0(n)  is  clearly  intepretable  as  an  ellipsoid. 
ut((n)  is  proportional  to  the  square  of  the  volume  of  the  ellipsoid, 
while  iii(n)  is  proportional  to  the  sum  of  squares  of  its  serm- 
a.xes.  The  same  two  measures  are  mearungful  in  the  MIMO  case, 
since  they  result  in  the  minimization  of  the  volume  or  trace  of 
the  common  ellipsoid  shared  by  all  the  outputs  (see  discussion 
below  (-4)),  The  original  OBE  algorithm  of  Fogel  and  Huang  (  F- 
H  OBE”)  [6]  follows  these  LOBE  steps  with  each 

rx.  The  set-memhership  wexg^xted  recurstv*  least  squares  algorxtKm 
(  S\f-  i'VfiLSj  of  Deller  et  al.  (e.g.  see  (3j.[4l)  is  a  L'OBE  algorithm 
with  Ci  =  I  for  ail  n. 

The  general  method  for  finding  the  L'OBE  optimal  weight  for 
minimizing  the  either  set  measure  is  given  in  [o].  These  methods 
include  results  for  F-H  OBE  and  SM-WRLS  as  special  caset.  but 
optimization  strategies  are  also  given  of  course  in  the  origined 
papers.  Ln  the  volume  case,  it  is  found  that  the  optimal  weight  is 
given  by  the  unique  positive  root  of  a  quadratic  equation  in  Arj(n). 
say  F,  (  \n  ( ^ ) ).  whose  coefficients  ar^  expressed  in  terms  of  known 
quantities  a?  t  me  n  —  I,  The  optimal  trace  weight,  if  it  exists, 
is  the  unique  positive  root  of  a  cubic  polynomial  say  Ft(  ^^(n)). 
The  critical  feature  to  keep  in  mind  is  the  infrequent  updating 
of  L  OBE  which  leads  to  interesting  performance  properties  and 
computational  efficiencies 

2.  Convergence  Issues 

Asymptotic  Estimates.  One  of  the  interesting  and  practical 
benefits  of  having  interpreted  UOBE  algorithm  as  a  WRLS  al¬ 
gorithm  with  a  bounded  error  “overlay  ”  is  the  immediate  conse¬ 
quence  for  convergence  of  the  estimator.  It  is  well-known  t\at  if 
the  sequence  r«(n)  is  wide-sense  stationary,  second  moment  er- 
godic  aln^ost  surely  (a  s.),  white  noise,  then  the  WRLS  estimator 
Bin)  will  converge  asymptotically  to  9«  a.s.  (e  g.  [T]).  In  the 
present  case,  we  need  only  to  add  the  qualifier  that  the  f  OBE 
.\lgonthm  not  cease  to  accept  data  in  order  to  lay  claim  to  this 
useful  result. 

Likewise,  we  may  even  assert  a  s.  convergence  of  the  WRLS 
estimate,  albeit  to  a  bias,  when  r*(n)  is  colored  and  persistently 
**xcitmg^  (p  e.)  f*]  Even  in  the  presence  of  colored  errors,  there¬ 
for'*,  as  long  as  the  acceptance  of  data  does  not  -  c.^e.  and  the 
infrequency  of  updating  does  not  interfere  with  the  persistency 

*  Ple.^e  r-ad  the  .abbreviation  p  e.  ’  as  persitently  *»xcit mg  ' 
or  ‘persistency  of  excitation.  ”  as  appropriate 


of  excitation,  we  may  expect  the  L  OBE  estimate  (ellipsoid  center) 
to  converge. 

Convergence  of  the  Ellipsoid.  It  would  be  interesting  to  have 
a  precise  understanding  of  the  asymptotic  behavior  of  the  hy- 
perellipsoidal  feasible  set.  especially  in  the  case  of  colored  noise. 
Knowledge  that  the  ellipsoid  is  vanishing  (white  noise),  or  be¬ 
coming  as  small  as  possible  (colored  noise),  could  be  v^ry  useful 
information  indeed.  In  the  white  noise  c£ise.  a  sufficiently  small 
ellipsoid  could  serve  as  a  reinforcing  indicator  of  convergence,  and 
offer  a  means  of  determining  error  bounds  on  the  estimate  in  fi¬ 
nite  time.  In  the  colored  noise  case,  a  small  feasible  set  ( known  to 
contain  the  true,  unbiased  estimate)  could  be  indispensible.  T  n- 
fortunately.  no  known  proof  of  this  desirable  result  for  any  raise  of 
L’OBE  with  simply  scaled  weights  exists.  The  remainder  of  this 
paper  indicates  recent  progress  made  toward  the  understanding 
of  the  convergence  properties  of  the  ellipsoid,  in  the  presence  of 
both  white  and  colored  noise  disturbances. 

We  first  present  an  important  contribution  toward  the  under¬ 
standing  of  the  asymptotic  behavior  of  the  ellipsoid: 


Proposition  1  Consider  tke  COBE  algorithm  simple  S'-ale 
factors  as  in  f  SJ.  If  an  optimal  weight  exssts  at  time  n.  th.*n  its 
use  will  certainly  dimmish  the  set  measure, 


u( n)  <  M(ri  -  1 ) 


tS) 


wheri^  n  IS  either  txyj  or  Therefore.  in  general 

M(n)  <  “  1 ).  Further,  for  the  trace  measure,  aj{0(n) }  =0 

tff  Cl(n)  =  {  ©(n)}  =  {©.}. 

Proof:  We  prove  the  resujt  for  Uv  The  proof  for  ut  is  simi¬ 
lar.  The  last  line  of  the  proposition  wiU  be  verified  in  the  future 
discussion. 

For  simplicity,  we  write  An(tt)  as  A.  Also,  the  functional  de¬ 
pendence  of  u,  (n)  up.'ii  i  for  a  fixed  n  is  the  central  issue,  so  we 
write  mv(  A).  It  has  b.- ii  shown  thaf^  [5] 

f  AG.(n)1  r"'*(n) 

u.(\)  =  r’"(n)(  I  -  (9) 

L  h(n)  j  h{n) 

where  h(n)  '=*  1  -I-  .i  ’i)  and  r(ri)  ='  n( n)/is,(n  -  1 ),  )S,(n)  =' 

<1^ )/<,’( 'll  and  C,(n)  ='  i  -  1 )  r^(n)  ( u  -  11  Jrl  n).  Thus 

It  is  found  that 
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where. 
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■and  /?(  V|  =  rnkhfn)- 
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For  future  reference,  also  note  that  [.s] 

■i.(n  ~  \)  =  F,.(\)  M-’l 

where  F.  (  ')  is  the  volume  quadratic  solved  to  ftml  the  .jptimal 
weight.  C'snsequently. 
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It  is  easy  to  demonstrate  that  Qi  '<  I  is  positive,  and  that  its  deriva¬ 
tive  IS  bounded,  for  \  ;0.  x  ).  Now  it  can  be  shown  that  [o] 

A  II  r(n.  0(n  -  1)) 
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allowing  us  to  write,  using  (11). 

^P(\)  ^  tr(n)  i)  fIn.WIn—  11);!'^ 

-  =  trnk  —  llG.ln)  — ^ -nx - 
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’  .A  similar  result  for  a  less  general  case  is  found  in  [  t] 
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Fii^.  1:  Typical  plot  of  iiv(  M  vs.  When  a  positive  root  of 
F,  ( .\ )  exists,  it  corresponds  to  a  mirumu/n  of  the  volunie  measur^. 


Because  of  (12)  it  is  clear  that  R(  \*)  =  0.  Reference  to  the 
definition  of  R{\)  in  (11).  therefore  immediately  sh  ws  that 
^  ‘^-’'^nsequentiy  >  0.  It  fol¬ 
lows  immediately  that  i  >  0  so  that  \*  corre- 

J  V  =  i  • 

sponcU  to  a  minjmum  of  uv(M-  Further,  since  = 

1  (see  ( 14)),  and  hlnl^.g  =  1,  we  have  from  (9)  that  u-  lO)  = 
I .  and  also  t hat  Cj (0 )  =  1 .  Therefore ,  from  (  1 0 1  and  (til. 


=  H(0)  = 
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to 
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where  '^o  the  zero  order  coefficient  of  the  quadratic.  F.  ( A)  = 
12  4-  'll  \  10  It  has  been  shown  [o]  that  no  positive  roots  of 

F.,  (  \)  f  .xist  if  JO  >  And  exactly  one  exists  if  lo  <  0.  U  follows 
that  the  derivative  in  ( 16)  is  negative:  hence,  u.  (  '*)  <  I  and  the 
proposition  is  proven  (see  Fig.  1).  □ 


Ln  spite  of  this  encouraging  result,  one  of  the  ^drawbacks  of 
t^  e  volume  approach  is  that  the  set  measure  ut-  is  not  a  proper 
mptric”  m  the  par2mjeter  space.  By  this  we  mean  the  following: 
'Suppose  wc  propose  the  distance  measure  i  such  that  at  time 
n,  H  0(*j).  0*)  =  iifdn).  We  immediately  find  that  i  fails  to 
be  a  proper  metric  since  i(  0(n).  0.)  =  0  does  not  imply  that 
01  n)  =  0,.  This  unfortunate  situation  arises  because  the  el¬ 
lipsoid  may  potentially  degenerate  and  reside  in  a  subspace  of 
thereby  achieving  zero  volume  without  being  reduced  to 
a  point  This  will  likely  only  occur  if  p.e.  is  not  achieved  as  we 
detail  below  and  is  therefore  a  more  important  problem  with  col¬ 
ored  disturbances.  This  potential  anomaly  provides  motivation 
t<T  consider  the  use  of  the  trace  measure  for  which  a  degenerate 
-lljpsoid  wiU  not  produce  a  zero  set  measure. 

(3ne  of  the  drawbacks  of  the  L  OBE  approach  is  that  the  hyper- 
ellipsoidal  bounding  sets  are  sometimes  quite  ’‘loose  ’  supersets  of 
the  exact  fe.vsibiUty  sets  (polytopes)  (e.g.  [il.[l0]).  particularly 
in  ’finite  ‘  time.  However,  many  simulation  studies  <n  the  liter¬ 
ature  (white  noise  case)  have  shown  the  volume  of  the  ellipsoids 
to  become  quite  small  in  the  long  term.  "  Furthri.  as  we  and 
other  researchers  have  demonstrated,  the  erripincal  '‘onvergence 
•and  tracking  properties  of  the  I.’OBE  estimator  are  favorable  in 
«pite  of  the  few  data  used.  This  is  an  indication  that  the  pr-j'-nc* 
of  the  ellipsoid  and  the  optimization  procedure  centered  on  it,  are 
quite  useful  for  sjgnad  processing,  regardless  of  our  present  inabil¬ 
ity  to  rompletely  understand  its  behavior  in  theory.  The  results 
presented  above  offer  further  support  for  good  behavior'  of  this 


class  of  algorithms  by  indicating  that  the  ellipsoid  measures  will 
converge  to  some  unspecified  size  in  some  unspecified  manner. 
This  result  has  not  been  clearly  understood,  and  its  finding  offers 
some  hope  that  a  proof  of  convergence  for  the  L  OBE  algorithms 
may  be  found  in  the  white  noise  case. 

Colored  Noise.  Whether  or  not  the  UOBE  ellipsoid  can  ulti¬ 
mately  be  proven  to  converge  to  a  point  with  white  noise  distur¬ 
bances.  such  a  result  would  cause  a  contradiction  in  the  colored 
noise  case.  This  is  so  because  if  limn— »  ^^(n)  is  a  single  p  u. 
then  this  implies  that  limn— .»  0(^)  =  in  violation  of  the 
basic  principles  of  least  square  estimation  (the  estimate  must  be 
biased).  We  therefore  conclude  the  following; 

Proposition  2  WifA  rolored  noxse  disturbanrfs,  limn— >cl^(n) 
15  J  1  :>'n-trtvt<ii  set. 

Empirical  evidence  leads  to  the  following  conjecture; 

Conjecture  1  With,  a  p.e.  input.  0*  is  on  the  boundary  the 
iimiting  set. 

Of  course  Proposition  2  is  not  suprising  for  a  non-p.e.  excitation 
for  which  the  algorithm  will  cease  to  accept  data  in  finite  time  due 
to  lack  of  innovation.  However,  that  ‘.he  ellipsoid  should  remciin 
nontrivial  for  p.e.  inputs  is  not  as  apparent,  since  each  time  a 
data  set  is  accepted,  the  set  measure  a(n)  must  be  diminished. 
This  brings  us  to  another  interesting  issue  centered  on  the  set 
measure  used  in  the  optimization. 

Persistency  of  Excitation.  Ln  the  following,  we  focus  on  the 
volume  and  trace  measures,  but  the  discussion  might  be  general- 
izable  in  certain  ways  to  broader  claisses  of  measures.  The  lengths 
of  the  ellipsoid  axes  at  time  n  ;j-e  inversely  proportional  to  the 

square  roots  of  the  eigenvalues,  say  e, ,  i  =  1 . ’n .  of  the  matrix 

C(n)/K|ri).  Accordingly,  convergence  of  the  ellipsoid  to  a  single 
point  requires  that  f.  —  x  for  all  i.  limn  — »  Ofn)  remains  non¬ 
trivial  iff  one  or  more  f  the  e,  remains  finite.  This  implies  that, 
in  the  presence  of  coJ-Ted  disturbances.  Limn— x:  Mi(n)  must  be 
positive  since  one  or  ni  -re  finite  eigenvalues  will  make  this  so.  On 
(he  other  hand,  mv  b<*comes  zero  much  more  readily,  because  a 
single  infinite  eigenvalue  is  sufficient  to  cause  limn—  x  ( n)  =  0. 
That  IS.  the  ellipsoid  need  only  ‘collapse  in  one  dimension  "  to 
assure  z#»ro  volume.  In  this  sense,  is  a  ‘w-eaker"  set  measure 
than  Uf 

The  behavior  of  )  is  not  wed  f»nough  understood  to  make 
definitive  conclusions  about  the  conditions  under  which  limiting 
elbpsoid  may  remain  nontri'-  ial  in  fewer  than  m  dimensions.  This 
has  implications  for  both  white  and  colored  noise.  For  whit'^  noise, 
this  means  we  do  not  know  when  ( if  ever)  the  ellipsoid  wiU  collapse 
to  a  point.  For  colored  noise,  it  simply  means  we  do  not  know 
when  the  ellipsoid  wiU  mUapse  into  a  subspace  of  the  parame¬ 
ter  space  (since  it  must  remain  nontrivial).  However,  use  of  the 
volume  measure  permits  .an  intrigtnng  situation  to  ari^e  piecisely 
because  of  its  "weakness  '  We  believe  that  this  situation  points 
to  the  general  condition  under  which  the  collapse  may  occur 

The  -asymptotic  vobirn?  measure  may  be  zero  even  if  the  el¬ 
lipsoid  IS  of  infinite  extent  in  one  or  mor^  dimensions.  That  Is, 
the  finite  eigenvalue{s)  which  imply  .a  nontnvial  linuting  set  may 
b<“  (Clearly,  this  -annot  be  the  case  with  the  trace  mea¬ 

sure  oince  it  would  imply  that  Limn  -x  at(^d  =  x  .}  I  ’nder  what 
conditions  might  this  degenerate’  voiume  situatii  n  occur.’  It  is 
templing  to  surmise  that  a  p.e.  input  would  be  necessary  to  drive 
Hie  volume  ultimately  to  zero.  However,  pr-^cisely  the  opposjt** 
IS  inie.  .A  singuiatr  Limn  —  x:  '.*(  n ) / -c(  n  |  occurs  iff  Umn  —  v  ^  i  i 
IS  singular.  Ln  turn,  this  is  indicative  of  a  non-p  e.  input  In 
this  case  the  ellipsoid  expands  without  bound  in  the  nuU  spa<  e 
-d  limn  —  X  c'l'n)  -iln).  while  it  must  cohaspe  in  (at  l<“ast  one  di¬ 
mension  of)  the  range  space  in  order  to  prevent  the  volume  from 
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Fig.  2;  "Asymptotic”  ellipsoids  resulting  from  first  example. 
In  each  case.  *  denotes  the  true  parameters  and  x  the  estimate 
(ellipsoid  center),  (a)  ~i  ~  2.  (b)  -»  =  1.25. 

also  diverging.  If  the  ellipsoid  converges  in  every  dimension  of  the 
range  space,  then  the  feasible  see  exists  entirely  within  the  null 
space  of  limn— C(n)/>«(n),  except  for  the  intersection  (point, 
line,  plane,  hyperplane)  with  the  range  space.  This  analysis  leads 
to  the  following  conjecture: 

Conjecture  2  "Oejeneracy"  of  the  limiting  ellipsoid  (collnpse 
:)/ limn— -xj  fi(n)  in(3  a  sitbspace),  will  occur  in  the  volume  case 
iff  the  input  ij  non-p.e. 

Examples.  In  order  to  illustrate  these  ideas,  we  present  two 
simple  examples.  In  the  both  examples  an  AR(2)  model  of  the 
form  !/(n)  =  ai,y(n  —  1)  +  aj,y(n  —  2)  +  e.(Ti)  is  used.  In  the 
first  example  the  parameters  are  ai,  =  0.6,  32*  =  O.l,  and 
r.(  )  is  a  realization  of  the  stochastic  process  v/2cos[(7rn/16)  +  <) 
with  (  a  uniformly  distributed  random  phase.  This  noise  is  p.e. 
of  order  two.  Two  identifications  were  performed  on  this  sys¬ 
tem  using  volume  optimization.  In  the  first,  -»(n)  is  ("prop¬ 
erly")  chosen  to  be  the  constant  =  2  =  e.,m»«(  ),  while  in 
the  second  experiment  t  =  1.25  in  slight  violation  of  the  proper 
bound.  The  "asymptotic"  (n  =:  7000)  ellipsoids  are  shown  in 
Figs.  2(a)  and  2(b),  respectively.  In  the  first  case  128/7000 
(1.8%)  of  the  data  were  selected  by  the  optimization  procedure, 
while  in  the  second.  101/7000  (1.4%)  were  used.  In  ioth  experi¬ 
ments  the  parameter  estimates  are  identical  to  six  decimal  places: 
ii(7000)  =  1.961508,  32(7000)  =  -0.999855.  Both  outcomes  ad¬ 
here  to  Proposition  2  in  the  production  of  a  nontrivial  limiting 
set  with  a  biased  estimate.  Some  support  is  seen  for  Conjecture  1 
in  the  proximity  of  the  true  parameters  to  the  ellipsoid  boundary. 


aj  J- 


Fig.  3:  ".Asymptotic"  eUipsoid  resulting  from  second  example. 

Interestingly,  the  second  experiment  produces  a  more  desirable 
outcome  in  this  regard,  and  does  so  using  fewer  data,  in  spite 
of  the  bound  violation.  UOBE  algorithms  are  remarkably  robust 
to,  and  indeed  sometimes  benefit  from,  such  violations  and  other 
In  the  second  example,  ai.  =  1.6,  a2«  =  —0.65,  and  the  sys¬ 
tem  is  excited  by  a  constant  c.(  )  =  0.322  selected  by  choosing 
a  value  from  a  random  realization,  at  a  random  time,  in  the  co¬ 
sine  process  above.  Accordingly,  "k(.)  is  taken  as  the  constant 
y  =  2  =  e.,ni»i(  ).  This  noise  is  p.e.  of  order  one,  and  is  there¬ 
fore  not  sufficient  to  uniquely  identify  the  system.  In  this  case 
30/7000  (0.43%)  of  the  data  were  used.  The  ellipsoid  in  Fig.  3 
is  the  resulting  set  at  time  n  =  7000.  A  collapsing  dimension  is 
apparent  as  the  covariance  matrix  becomes  singular  and  the  fea¬ 
sibility  set  begins  to  occupy  the  null  space  of  the  ellipsoid  matrix. 
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We  have  just  received  the  first  announcement  for  the  9th  IFAC/IFORS 
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Abstract 

This  paper  is  concerned  with  the  set  memhership  weighted  re¬ 
cursive  least  squares  (SM-WHLS)  algorithm  which  can  be  used 
for  estimating  the  parameters  of  linear  system  or  signal  models 
in  which  the  error  sequence  is  poinlwise  "energy  bounded.”  This 
algorithm  works  with  bounding  hyperellipsoidal  regions  to  de¬ 
scribe  the  solution  sets.  A  new  strategy  is  developed  which  can 
be  applied  to  virtually  any  version  of  the  SM-WRLS  algorithm  to 
improve  the  computational  efficiency.  .A  significant  reduction  in 
computational  complexity  can  be  achieved  by  employing  a  "sub- 
optimal”  test  for  information  content  in  an  incoming  equation. 
The  proposed  check  is  argued  to  be  a  useful  determiner  of  the 
ability  of  incoming  data  to  shrink  the  ellipsoid.  The  performance 
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1  Introduction 

The  set  membership  weighted  recursive  least  squares  (SM-WRLS) 
algorithm  [1.  2.  3]  is  an  efficient  technique  which  can  be  used  for 
estimating  the  parameters  of  linear  system  or  signal  models  under 
a  prion  information  which  constrains  the  solutions  to  certain 
sets.  When  data  do  not  help  refine  these  membership  sets,  the 
eifort  of  updating  the  parameter  estimates  at  those  points  can 
oe  avoided.  The  SM-WRLS  algorithm  is  concerned  with  the 
case  in  which  the  error  sequence,  say  u(n),  is  pointwise  "energy 
bounded.” 

7(n)v^(n)<l  (1) 

where  the  sequence  7(n)  is  known  or  can  be  estimated  from  the 
data.  Constraints  of  form  ( 1 ),  in  conjunction  with  the  model  and 
data,  imply  pointwise  "hyperstrip”  regions  of  possible  parameter 
sets  in  the  parameter  space  which,  when  intersected  over  a  given 
time  range,  usually  form  convex  polytopes  of  permissible  solu¬ 
tions  for  the  "true"  parameters.  While  exact  descriptions  of  these 
polytopes  are  possible  (e.g.,  see  [4]),  algorithms  of  much  lower 
complexity  have  been  developed  which  work  with  a  bounding 
hyperelliptoid,  a  tight  superset  of  the  polytope  [1.  2,  3.  5.  6,  Tj. 
The  SM-WRLS  algorithm  is  one  such  algorithm  which  is  for¬ 
mulated  such  that  it  is  exactly  the  familiar  weighted  recursive 
least  squares  solution  [8,  9j  with  the  SM  considerations  handled 
through  a  special  weighting  strategy.  A  tutorial  on  the  SM- 
WRLS  algorithm,  and  on  general  SM  identification  is  found  in 
[2!- 

One  of  the  advantages  of  the  SM-WRLS  formulation  is  that 
it  immediately  admits  solution  by  contemporary  systolic  array 
processors  for  speed  advantages  [Ij.  The  systolic  array  imple¬ 
mentations  of  this  algorithm  are  presented  in  [10.  111.  In  this  pa¬ 
per,  however,  we  will  develop  a  "suboptimal”  strategy  which  can 
be  applied  to  virtually  any  version,  adaptive  or  "non-adaptive,” 
of  the  SM-WRLS  algorithm  to  improve  the  computational  effi¬ 
ciency.  The  theoretical  development  of  this  strategy  is  presented 
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in  Section  2.  In  Section  3.  the  performance  of  this  strategy  is 
compared  to  that  of  SM-WRLS  using  simulation  studies.  .A  de- 
tailed  analysis  of  the  computational  complexity  issues  is  found  in 
Section  4. 

2  Theoretical  Development 

2.1  The  SM-WRLS  Algorithm 

We  consider  the  estimation  of  the  parameters  of  a  general 
.ARMAX(p,  g)  [9]  model  of  the  form 

p  fl 

»(")  =  J^n,y(n  -  i)-i-^bjiv(n  -  j)-i- vin)  (2) 

isl  ;=0 

i  efi(n) -I- v(n) 

in  which  y{n)  is  a  scalar  output;  ui(n)  is  a  measurable,  uncorre- 
lated,  input:  and  t>(n)  is  an  uncorrelated  process,  known  to  be 
bounded  as  in  (V),  which  is  independent  of  u>ln).  For  conve¬ 
nience,  we  aiio  employ  the  vector  notations 

=  [ifO  -  1)  •  ■  -  p)tii(n)ui(T>-  D-  ■  .ui(n  -  g)]  (3) 

and 

- ■  “^*0*1  ■  ■  ■*,]  (4) 

Op  represents  the  vector  of  parameters  to  be  estimated.  We  define 
the  integer  m  =  p  -(-  g  1  noting  that  m  should  be  reduced  to 
simply  m  =  p  for  the  pure  AR  case. 

Let  us  define  R(iV)  to  be  the  conventional  weighted  LS  es¬ 
timate  of  9o  using  the  data  on  the  range  n  =  1.  2 . .V.  with 

squared  error  minimization  weights  Afu),  .Also  denote  the  (weighted) 
covariance  matrix  for  the  data  by  C(jV).  As  a  consequence  of  the 
bounds  on  the  sequence  v(n),  at  time  .V  there  is  a  computable 
hyperellipsoidal  domain  in  the  parameter  space  which  certainly 
contains  Op  and  which  is  centered  on  the  LS  estimate.  This  set 
is  given  by  [3] 

n(.v)  =  |R|(R-R(.v))r^c(.v)(«-s(.v))<  i|,  (5) 

R  €  R"’,  where. 

x(iV)  =  e^(,V)C(iV)R(,V)-f  y':^(l-7(n)y'(n))  (6) 

«i  'I")  ^  ' 

=  Hd,(.V)]|'-K,c(.V)  (71 

in  which  |{  •  II  denotes  the  Euclidean  norm, 

d,(.v)  =  r(iV)«.v)  ,8) 

with  Tf.Vj  the  upper  triangular  Cholesky  factor  [12]  of  CIS) 

i  more  on  this  below),  and  k(.V)  denotes  the  sum  on  the  right 
side  of  (6).  It  is  useful  to  note  that 

8(n)  =  8(n  -  1)  7-  -  7,n)y'(n))  ,  8(01  =  0  -9i 

7(n|  '  ' 


Very  importantly,  the  '“size"  of  this  domain  is  a  function  of  only 
one  unJcnown  at  time  n,  A(n).  the  error  minimization  weight  at 
time  n.  The  “SM  strategy”  of  updating  the  parameter  vector  at 
time  n  involve*  the  computation  of  the  A(n)  which  minimizes  the 
size  of  n(n).  The  volume  of  the  ellipsoid.  0(n).  is  proportional 
to  the  quantity 

det  B(n)  =  det  «(n)C~‘fo)  =  ^ — .  (10) 

del  C{  n ) 


A  reasonable  strate^  is  to  find  an  optimal  weight.  A*(n),  at  each 
step  which  mimmizes  the  ’‘volume  ratio"  of  the  ellipsoids  at  n  and 


-  1: 


V{X{n)) 


det  Bln) 
det  Bln  -  11 


(U) 


This  weight  is  taken  to  be  the  most  positiv*»  root  of  the  quadratic 
equation  [3| 

f(A)  =  a^A^-rOtA-t-oo  =  0  (12) 

where, 

Q,  =  (m-l)G'(n) 

oi  =  {2m  -  1  +  7(n)f^_,(n)  -  itin  -  DylnjGlnllGfn) 

oo  =  m[l  -  -  <s(n  -  l)-r(n(G(n) 


and  where, 

G(n)  =  n  -  l)i( n)  1 13) 

and  Sn-iCt)  is  the  residual  at  time  n  based  on  the  parameter 
estimate  at  n  -  1, 


f».i(n)  =  y(n)  -  -  l)*(n)  .  (14) 

One  important  consequence  of  this  approach  is  that  often  no  A(n) 
exists  which  will  further  shrink  flfn).  Generally,  this  means  that 
the  equation^  (y(n),z(n))  at  ri  has  "nj  information*  which  has 
not  already  been  incorporated  into  the  estimate.  In  this  case  the 
equation  is  ■‘rejected"  (A(ti)  effectively  set  to  zero)  saving  the 
computational  expense  otherwise  necessary  to  incorporate  it. 

In  a  recent  paper  [IJ,  SM-WRLS  was  formulated  into  a  more 
contemporary  WRLS  algorithm  which  is  amenable  to  a  systolic 
architecture  implementation.  The  algorithm  is  given  in  the  Ap¬ 
pendix  of  this  paper.  In  Steps  4  and  5  of  the  algorithm,  the  LS 
problem  is  solved  using  a  sequential  ”QR"  decomposition  using 
Giveru  rotations  (GR's),  a  method  which  is  well-understood  and 
becoming  widely  used  for  this  purpose  [13,  14,  13j.  (It  is  im¬ 
portant  to  note  the  meaning  of  the  matrix  T(n)  in  this  process. 
We  first  encountered  T(n)  in  (8)  where  it  was  defined  as  the 
upper-triangular  Cholesky  factor  matrix  of  C(n)  at  each  step, 
i.e..  C(n)  =  T^(n)T(n)  (see  Appendix)).  The  more  novel  part 
of  the  algorithm  in  Steps  1,  2.  3  and  6,  is  concerned  with  the 
computation  of  the  optimal  weights.  Here  the  method  had  to 
be  designed  to  avoid  the  costly  inversion  of  the  matrix  C(n), 
nominally  necessary  to  compute  G(n).  The  quantity  K(n)  is  also 
efficiently  computed  in  this  context.  The  reader  is  referred  to  [1] 
for  details. 

2.2  Adaptive  SM-WRLS  Algorithm 

In  this  section,  we  present  an  adaptive  SM-WRLS  algorithm  with 
a  very  flexible  mechanism  by  which  it  can  ■fforget”  the  influence 
of  past  data. 

The  adaptive  algorithm  presented  here  uses  “back  rotation” 
in  order  to  partially  or  completely  “forget”  past  information  en¬ 
abling  it  to  track  (potentially  fast)  time  varying  signals.  Back 
rotation  [13|  is  a  Givens  rotation-based  technique  that  removes 
(or  rotates  out)  a  previously  included  equation  from  the  system. 
In  this  paper  we  modify  the  back  rotation  so  that  a  previous 

‘Sisce  the  LS  process  represeste  as  effort  to  fit  the  rn  model  parameter* 
to  .V  eqoaiioae  of  the  form  pfit)  a  we  refer  to  the  pau  (pfnf.ifn)) 

as  as  ‘eqaatioa*  throoghoat  this  paper 


equation  can  be  partially  removed.  This  will  permit  a  broader 
class  of  adaptive  strategies.  In  SM  terms,  back  rotation  causes 
the  ellipsoidal  membership  set  to  expand  due  to  the  removal  of 
information.  This  expansion  entices  the  algorithm  to  incorporate 
present  data.  The  back  rotation  technique  requires  that  all  the 
weights  with  the  corresponding  equations  (for  weights  other  than 
zero  I  be  stored  for  later  use. 

In  the  .Appendix,  we  see  that  at  each  step  in  the  SM-WRLS 
algorithm,  the  upper  triangular  system  of  simultaneous  equations 
Tl  n  iff(  n)  =  di(  u),  is  solved  ( when  data  are  accepted)  to  obtain 
the  optimal  estimate  ;2.  3.  lOj.  Suppose  in  approaching  time 
n  that  the  past  equation  to  be  (partially)  removed  is  at  time 
r.  Rotating  this  equation  out  of  the  system  is  accomplished  by 
re-introducing  it  as  though  it  were  a  new  equation.  .A  weight 
sj  ’iiMr).  where  u  is  the  fraction  of  the  equation  to  be  removed 
from  the  system,  is  used,  and  some  sign  changes  in  the  rotation 
equations  are  necessary  [13].  Let  us  refer  to  the  system  of  equa- 
'ions  with  r  removed  as  the  “downdated"  system  at  time  n  -  1, 
and  label  the  related  quantities  with  subscript  d,  i.e.. 

Tiln-  l)ds(n  -  1)  =  ii,.j(n  -  1)  .  (15) 


The  downdated  ellipsoid  matrix  is  Cs(u  —  U/ksIu  -  D  where 


Cj(n-l)  =  Tf(n- l)Ti(n-  1)  . 

(16) 

x^(n  -  1)  =  !|  di,d(n  - 

1)||'-|-/C^(I1-1) 

(17) 

with 

-  1)  =  K{n  -  1)  - 

Tlr) 

(l  -  7(’-)y^(r))  . 

(18) 

Equations  1 17)  and  (18)  follow  immediately  from  the  definition 
of  K  found  in  (6).  These  relations  can  be  used  repeatedly  regard¬ 
less  of  the  number  of  equations  (partially  or  completely)  removed 
prior  to  time  n.  If  more  than  one  equation  is  removed  prior  to  n, 
k(n~  1)  in  the  right-hand  side  of  (18)  is  replaced  by  8s(n-  1)  for 
all  downdates  after  the  first  one.  Following  all  necessary  down¬ 
dating  just  prior  to  time  n.  the  algorithm  uses  the  downdated 
system  to  compute  the  quantities  Gs(f»)  and  ff,-i.4(n)  which  are 
necessary  to  compute  the  optimal  weight  for  the  equation  at  n. 

To  compute  a  downdated  S.\I-WRLS  estimate,  therefore,  it  is 
only  necessary  to  downdate  the  matrix  Tju  -  1)  and  the  vector 
di(n  -  1)  and  to  solve  for  -  1)  prior  to  Step  1,  then  replace 
all  relevant  quantities  in  Step  I  by  their  downdated  versions. 
Kiln  -  1)  and  Ks(n  -  1)  are  downdated  according  to  (IT)  and 
|T8I.  Then  A*(n)  is  found  in  Step  2  using  (12)  with  downdated 
quantities.  Note  that  downdating  is  unnecessary  if  the  equation 
r  was  rejected  by  SM-WRLS.  In  this  case  Tj(n  -  1)  =  T(n-  1) 
and  -  1)  =  0(n  -  1).  Conversely,  when  the  “new”  equation 
at  n  is  rejected,  then  T{n)  =  Td(n  -  1)  and  ff(n)  =  -  1). 

.A  wide  range  of  adaptation  strategies  is  inherent  in  the  gen¬ 
eral  formulation  described  above.  Three  major  subcases  are  iden¬ 
tified  (windowing,  graceful  forgetting,  and  selective  forgetting) 
in  jll,  16).  In  each  of  these  subcases,  the  objective  is  to  expand 
the  ellipsoidal  region  of  possible  solutions  in  order  to  track  fast 
time  variations  in  the  signal.  For  illustration  and  comparison 
purposes  in  the  simulations  below,  we  use  one  of  the  adaptive 
strategies,  namely,  the  selective  forgetting  in  which  the  equations 
are  removed  from  the  estimate  according  to  certain  user-defined 
criteria  in  order  to  remove  their  influence  on  the  result.  The  selec¬ 
tion  criterion  used  here  is  to  remove  the  equations  starting  from 
the  first  accepted  equation  remaining  in  the  estimate  at  a  given 
time,  and  proceeding  sequentially  until  some  other  condition  is 
satisfied.  The  determination  of  when  to  apply  the  forgetting  pro¬ 
cedure  and  when  to  stop  removing  equations  at  a  given  time  is 
discussed  in  [16|. 

2.3  Suboptimal  Test  for  Innovation 

A  significant  reduction  in  computational  complexity  can  be  achieved 
by  employing  a  “suboptimal”  test  for  information  content  in  an 


mcoming  equation.  The  proposed  check  l>  argued  to  be  a  OMfoi 
determiner  of  the  ability  of  incoming  data  to  shrink  the  ellipsoid, 
but  it  doe*  not  rigorously  determine  the  existence  of  an  optimal 
SM  weight  in  the  sense  described  above.  The  main  issue  here 
is  to  avoid  the  computations  of  the  quantities  necessary  at  each 
step  to  construct  and  solve  the  quadratic  (121  in  cases  in  which 
the  quadratic  turns  out  only  to  be  useful  for  the  purpose  of  check¬ 
ing  for  the  existence  of  a  meaningful  weight.  Since  most  of  the 
time  these  computations  result  in  the  rejection  of  incoming  data, 
a  more  efficient  test  could  significantly  reduce  the  complexity  of 
the  algorithm. 

The  estimation  error  vector  at  time  n  can  be  denoted  by 

9ln}  =  00  -  0(  n)  .  (19) 

The  following  inequality  results  immediately  from  (  5 1. 

9^(n)C(n  !^n)  <  K(  m  .  (20) 

Using  a  similar  inequality,  Dasgupta  and  Huang  [6]  have  noted 
that  their  ic(nl-like  quantity  provides  a  bound  on  the  error  vector 
sequence  and  have  suggested  minimizing  this  quantity  with  re¬ 
spect  to  A(n)  in  an  effort  to  decrease  computational  compleidty. 
However,  this  minimization  does  not.  in  general,  imply  an  im¬ 
provement  in  the  estimate  with  respect  to  previous  times,  since 
both  sides  of  the  inequality  (20)  are  dependent  upon  A(n).  Fur¬ 
ther.  the  nonexistence  of  a  minimum  of  <c|  <i)  with  respect  to  A(q) 
is  not  very  informative  in  this  sense.  However,  further  arguments 
are  presented  here  to  provide  support  for  this  process  in  the  SM- 
WRLS  context. 

Consider  the  usual  volume  quantity  to  be  minimized  at  time 
n,  defined  in  (10).  Let  us  temporarily  write  the  two  key  quan¬ 
tities  there  as  functions  of  A(n)  :  C(n.A(n))  and  /c(ri.A(n)).  It 
IS  assumed  that  enough  equations  have  been  included  in  the  co¬ 
variance  matrix  at  time  n  -  1  so  that  its  elements  are  large  with 
respect  to  the  data  in  the  incoming  equation.  Now  the  quan¬ 
tity  det  C(ri,  A(n))  is  readily  shown  to  be  monotonically  increas¬ 
ing  with  respect  to  A(n)  on  A(n)  €  [O.oo)  [16],  with  C(n,0)  = 
C(ri-1.  A’{n-1)),  where  A*(n-l)  indicates  the  optimal  weight  at 
time  n  -  1.  Under  the  assumption  above,  det  C(n,  A(n))  will  not 
increase  significantly  over  reasonably  small  values  of  A(n).  The 
attempt  to  maximize  det  C(n,  A(n))  in  ( 10)  causes  a  tendency  to 
increase  A(n)  in  the  usual  optimization  process.  However,  the  at¬ 
tempt  to  minimize  is(n,  Afn))  generally  causes  a  tendency  toward 
small  values  of  A(ti).  unless  a  minimum  of  K(n,A(n))  occurs  at 
a  ‘‘large"  value  of  A(  n ).  To  pursue  this  idea  and  further  points 
of  the  argument,  key  results  about  K(n,A(n))  are  noted  in  the 
following. 

Theorem  1  <|ri,  A(n))  has  the  foUottnnq  properties: 


This  is  a  concave  upward  quadratic  function  with  its  minimum 
at 

A'(n)  =  -G-‘(n)  <  0  .  >24) 

Two  real  roots  of  (23)  always  exist. 


the  smaller  corresponding  to  a  maximum  of  «( n.  A(  ri  1 1.  the  larger 
to  a  minimum.  Only  the  larger  root  can  be  positive  since  the 
lower  root  is  bound  to  be  less  than  A  (n).  Therefore,  it  is  only 
possible  for  x(n.  A(n))  to  exhibit  a  minimum  or  to  be  increasing 
on  positive  A(n).  It  is  easy  to  use  (25)  to  verify  that  the  larger 
root  is  positive  i^condition  (21)  is  met.  3 

With  these  results,  it  can  be  argued  that:  If  det  C(ri.  Amu  is 
increasing,  but  not  changing  significantly  over  reasonably  small 
values  of  A(u).  then  it  is  sufficient  to  seek  A(n)  which  minimizes 
<(n.  A(n|).  If  /tin.  A(n))  is  monotonically  increasing  on  A(ni  >  0. 
this  value  is  A(n)  =  0  which  corresponds  to  rejection  of  the  equa¬ 
tion  at  time  n.  It  suffices,  therefore  to  have  a  test  for  a  minimum 
of  /c(n,A(n))  on  positive  Afn).  .4s  noted  above,  a  simple  test  is 
embodied  in  condition  (21).  If  this  test  is  met.  it  is  then  cost 
effective  to  proceed  with  the  standard  optimization  process  cen¬ 
tered  on  (12).  Otherwise,  the  explicit  construction  and  solution 
of ( 12)  can  be  avoided. 

It  IS  to  be  noted  that  ewn  if  (21)  is  met,  it  is  possible  that 
the  optimization  procedure  will  still  reject  the  datum.  Perhaps 
more  importantly,  it  is  also  possible  for  (21)  to  reject  data  which 
would  have  been  accepted  by  the  usual  process.  These  ideas  will 
be  explored  in  the  simulation  studies  below. 

Finally,  note  that  when  the  simplified  test  (21)  accepts  the 
new  equation,  there  are  tools  to  compute  the  weight  which  is 
'optimal”  in  the  sense  of  minimizing  x(n,  A(a)).  In  particular, 
this  would  be  the  larger  of  the  roots  in  (25).  However,  it  clearly 
makes  more  sense  to  compute  the  optimal  weight  according  to 
(12),  since  '.ms  computation  is  not  much  more  expensive.  The 
improvement  in  the  computational  complexity  due  to  'subopti- 
mal  checking"  is  discussed  in  Section  4.  It  is  important  to  note 
that  the  general  adaptive  formulation  of  Section  2.2  is  amenable 
to  the  suboptimal  technique  described  here.  The  performance  of 
the  suboptimal  and  the  adoptive  suboptimal  techniques  will  be 
investigated  in  the  next  section. 

3  Simulation  Studies 

in  this  section,  we  consider  the  estimation  of  the  parameters  of 
two  time  varying  .4R(2)  models  of  the  form 

y(n)  =  Oi(Ti)y(Ti  -!)■(-  ai(n)y{n  -  2)  -l-  v(n;  .  (26) 


•  On  the  domain  A(ti)  6  [0.  m).  «in,A(n))  is  either  mono- 
tomcof/y  increasing  or  it  has  a  single  minimum. 

*  n(n,  A(n))  has  a  minimum  on  Afn)  €  [O.oo)  iff 

d-ifn)  >  fV")  ■  (21) 


Lemma  I  [SJ.  Let  A*(n  -  1)  denote  the  optimal  weight  m  the 
sense  of  (It)  (which  might  he  tero)  at  time  n  -  1.  Then 


<(n.  A(n))  =  n(n-  I,  A"(n-  1  ))-*•- 


A(n)ci[_i(n) 

1  -r  A(n|G(n) 


.  (22) 


Proof  of  Theorem  1  The  minimum  of  <i(n.  A(n))  with  respect 
to  Afn)  can  be  found  by  differentiating  (22)  and  setting  the  result 
equal  to  0, 


Two  secs  of  .4R  parameters  were  derived  using  linear  prediction 
(LP)  analysis  of  order  two  on  utterances  of  the  words  "four”  and 
"six"  by  an  adult  male  speaker.  The  datw  were  sampled  at  10 
kHz  after  4.7  kHz  lowpass  filtering,  and  the  'forgetting  factor" 
in  the  LP  algorithm  (see  [17])  was  '  -  0.996.  .4  7000  point 
sequence.  y(n).  for  each  case  (*fo'  and  “six")  was  generated 
by  driving  the  appropriate  set  of  r  arameters  with  an  uncorrelated 
sequence.  v(n).  which  was  uniformly  distributed  on  [-1.1].  In 
the  simulations  below,  we  apply  the  conventional  and  suboptimal 
SM-WRLS  algorithms  to  the  estimation  of  the  a,  parameters. 

We  discuss  a  number  of  simulation  results.  To  conserve  space, 
only  the  result  for  a-  is  Jlustrated  in  each  case.  Each  figure  shows 
two  curves,  one  fo:  ihe  true  parameter,  the  other  for  the  estimate 
obtained  by  tb  '  algorithm  under  study. 

Figures  1  snd  2  show  the  simulation  results  of  the  conven¬ 
tional  SM  WRLS  algorithm  for  the  words  four  and  six  using  only 
1.86%  and  2.16%  of  the  data,  respectively.  Figures  3  and  4  show 
the  'imulation  results  of  the  conventional  SM-WRLS  algonthm 
with  suboptimal  data  selection.  In  this  case,  only  1  19%  and 
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Figure  1:  Simulation  results  of  the  SM  WRLS  algorithm  for  the 
word  four.  1.86%  of  the  data  is  employed  in  the  estimation  pro¬ 
cess. 
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Figure  2:  Simulation  results  of  the  SM-WRLS  algorithm  for  the 
word  SIX.  2. 16%  of  the  data  is  employed  in  the  estimation  process. 

1.53%  of  the  data  are  used  for  the  words  four  and  six.  respec¬ 
tively.  Compared  to  the  conventional  SM-WRLS  algorithm  (see 
Figs.  1  and  2),  the  snboptimal  technique  uses  slightly  fewer  data 
but  produces  comparable  estimates.  It  is  interesting  to  note  that 
moat  of  the  equations  (97.6%  for  the  word  four  and  94.4%  for 
the  word  six)  that  are  accepted  by  the  suboptimai  technique  are 
also  accepted  by  the  conventional  SM-WRLS  algorithm.  It  is 
also  interesting  to  note  that  the  equations  that  are  accepted  by 
the  suboptimai  technique  but  not  by  the  conventional  SM-WRLS 
algorithm  lie  mostly  in  regions  of  fast  changing  dynamics. 

Figures  5  and  6  show  the  simulation  results  of  the  selective 
forgetting  adaptive  strategy.  This  strategy  uses  only  3.6%  and 
2.33%  of  the  data  for  the  words  four  and  six.  respectively.  More 
data  than  with  the  conventional  SM-WRLS  algorithm  are  used, 
but  more  accurate  estimates  result  and  the  time  varying  param¬ 
eters  are  tracked  more  quickly  and  accurately.  This  can  be  eas¬ 
ily  seen  when  the  parameter  dynamics  change  abruptly  near  the 
point  2100  for  the  word  four  (see  Fig.  51  and  near  the  points 
2000  and  4500  for  the  word  six  (see  Fig.  6). 

We  have  noted  that  the  general  formulation  of  the  adaptive 
SM  WRLS  algorithm  is  amenable  to  the  suboptimai  techmque. 


Figure  1:  Simulation  results  of  the  SM-WRLS  algorithm  with 
suboptimai  data  selection  for  the  word  four.  1.19%  of  the  data 
IS  employed  in  the  estimation  process. 
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Figure  4:  Simulation  results  of  the  SM-WRLS  algorithm  with 
suboptimai  data  selection  for  the  word  six.  1.53%  of  the  data  is 
employed  in  the  estimation  process. 

The  simulation  results  of  the  selective  forgetting  SM-WRLS  tech¬ 
nique  with  suboptimai  data  selection  are  shown  in  Figs.  7  and  3. 
This  strategy  uses  only  1.89%  and  1.86%  of  the  data  for  the  words 
four  and  six.  respectively.  Compared  to  the  selective  forgetting 
strategy  (Figs.  5  and  61.  the  selective  forgetting  techmque  with 
suboptimai  data  selection  uses  fewer  data  but  produces  compa¬ 
rable  estimates.  On  the  other  hand,  when  compared  to  conven¬ 
tional  SM-WRLS  with  suboptimai  data  selection  (Figs.  3  and 
4).  the  selective  forgetting  suboptimai  techmque  uses  more  data 
but  produces  better  estimates 

4  Complexity  Analysis 

[q  order  to  perform  a  detailed  analysis  of  the  computational  com¬ 
plexities,  we  employ  the  following  notations;  If  the  fraction  of  the 
data  accepted  by  the  conventional  SM-WRXS  algorithm  is  de¬ 
noted  by  r,  the  fraction  of  the  data  accepted  by  the  SM-WRLS 
algorithm  with  suboptimai  data  selection  by  s  {s  <  r),  and  the 
fraction  of  the  data  accepted  by  the  SM-WRLS  algorithm  after 
passing  the  test  (211  by  M(  <  5).  then  the  total  computational 
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Figure  5:  Simulation  results  of  the  selective  forgetting  SM*W'RLS 
algorithm  for  the  word  four,  3.6%  of  the  data  is  employed  in  the 
estimation  process. 
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Figure  6:  Simuiaiioo  results  of  the  selective  forgetting  SM-WRLS 
algorithm  for  the  word  six.  2.83%  of  the  data  is  employed  in  the 
estimation  process. 

complexity  of  the  conventional  SM-WRLS  algorithm  is  given  by 
[16| 

(m’  +  2m  4-  13)  +  r  i2m^  +  3m  +  t]  (27) 

floating  point  operations  (flops)  per  equation.  For  th«  SM-WRLS 
algoritlim  with  suboptunal  data  selection,  it  is  given  by  [16] 

(m  +  1)  +  j  |m^  +  m  +  12j  +  f  !2m^  *  3m  4-  tJ  (28) 

flops  per  equation.  When  considering  a  typical  example  to  com¬ 
pare  the  complexities  of  the  two  strategies,  the  suboptimal  strat¬ 
egy  reduces  the  computational  complexity  of  the  conventional 
algorithm  by  60  -  70%,  which  is  clearly  advantageous  especially 
when  noting  that  the  simulation  results  of  the  two  strategies  are 
comparable. 

If  the  fractions  of  the  data  used  by  the  adaptive  SM-WRLS 
algorithms  are  denoted  by  the  same  symbols  used  above,  and  the 
fraction  of  the  data  removed  from  the  system  is  denoted  by  a. 
then  the  total  computational  complexity  of  the  selective  forget- 
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Figure  7:  Simulation  results  of  the  selective  forgetting  SM-WRLS 
algorithm  with  suboptimal  data  selection  for  the  word  four. 
l.S9%  of  the  data  is  employed  in  the  estimation  process. 


i 


0  ;  <  3  .  1  «  . 

Samole.  ■3 

Figure  8:  Simulation  results  of  the  selective  forgetting  SM-WRLS 
algorithm  with  suboptimal  data  selection  for  the  word  six.  1.86% 
of  the  data  is  employed  in  the  estimation  process. 

ting  SM-W’RLS  algorithms  is  given  by 

(.5rn^  +  2.5m  -i-  13^  -t-r  f2.5m^  4-  10.5m  4-  s]  4-u  [2m*  4-  10m  4-  5| 

(29)‘ 

flops  per  equation.  For  the  suboptimal  selective  forgetting  SM- 
WRLS  algorithms,  it  is  given  by 

(m  -4  1)  4-  3  [.5m*  4-  1.5m  4-  12]  (30) 

-►(  [2.5m*  4-  10.5m  4-  5]  4-  u  ]2m*  4-  10m  4-  5] 

flops  per  equation.  Again,  the  computational  complexity  is  re¬ 
duced  by  60  -  70%. 

5  Conclusion 

This  paper  presents  a  suboptimal  data  checking  strategy  for  the 
SM-WRLS  algorithm.  It  also  shows  how  adaptation  can  be  in¬ 
corporated  into  SM-WRLS  in  a  very  general  way  by  Introducing 
a  flexible  mechanism  by  wlxich  the  algorithm  can  forget  the  in¬ 
fluence  of  past  data.  The  suboptimal  technique  i  which  can  be 


applied  to  vinually  any  version,  adaptive  or  non-adaptive,  of  the 

SM-WRLS  algorithm)  uses  many  fewer  data,  is  a  square-root 

factor  better  in  computational  complexity,  and  produces  compa¬ 
rable  estimate!  to  the  optimal  algonthm. 
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Appendix 

SM-WRLS  Algorithm  Based  on  QR  Decomposition 

Initialization:  Fill  an(m-i-l)x(m-i-l)  working  matrix  W 

with  zeros.  For  n  =  1 . m  -*-  1.  set  A|  n)  =  0.  Set  8(0)  =  0. 

Recursion^:  For  n  =  1 . V, 

1.  (Skip  if  n  <  m  -e  1  (see  footnote).)  L'pdate  C(n).£,_,(n) 
as  follows. 

Solve 

T^(n  -  l)9(n)  =  *(n) 

for  g{n)  by  back-substitution.  Compute  G(n)  =  ]  9(n)  ;|^ 
and 

Se-i(n)  =  y(n)  -  R  (n-l)*(n|. 

2.  iSkip  if  n  <  m  -(-  1  (see  footnote).)  Compute  the  optimal 
A’(n)  by  finding  the  most  positive  root  of  (12). 

3.  (Skip  if  n  <  m-l- 1  (see  footnote).)  If  A*(n)  <  0.  set  Tfn)  = 
Tin  -  1).  9(n)  =  d(n  -  1),  8(n)  =  8(n  -  1),  and  go  to  Step 
7.  Otherwise  contmue. 

4.  Update  T(n)  as  follows.  Replace  the  bottom  row  of  W  by 

yA*(n)  [*^(n)  i  y(n)l  , 

“Rot.ate"  this  new  row  into  W  using  Givens'  rotations, 
leaving  the  result 

[Tin')  I  d|(n)] 

in  the  upper  m  rows  of  W  These  rotations  involve  the 
scalar  computations  [13.  14j), 

''^'■ni-i,*  =  -(-  ,t<T« 

for  t  =  j.  j  -V  1, .  , . ,  m  -  1  and  for  j  =  1, 2, ....  m;  where 

o  =  W„ip.  r  = 


b  is  tunity^.  and  W.^  ■  IV'^)  is  the  j.k  element  of  W  pre- 
I post-)  rotation. 

5.  (Skip  if  n  <  m  (see  footnote).)  Update  0(n)  by  solving 

T(n)fl(Ti)  =  d,{n) 
using  back-substitution. 

6.  Update  ic(n)  and  8(n)  according  to  (7)  and  (9).  (Compute 
and  store  only  8(n)  if  n  <  m.) 

7.  If  n  <  .V,  increment  n  and  return  to  Step  1. 

'Generally  Dm  does  not  become  aoaMagulai  nelil  s  =  m  -v  1  The  first 
^fn)  cannot  be  computed  until  n  =  mu- 1  and  the  first  optimal  weight.  A' ; ul. 
cannot  be  computed  until  n  w  m  2  lit  is  convenient  to  let  Ain)  =  l  on  the 
initial  range),  and  <tin)  is  not  needed  until  n  s  m  4-  1  However,  alnl  must 
be  computed  for  all  n  (beginning  with  i(0)  =  0)  so  that  i7)  u  elfiaently 
started  at  n  w  m  1  it  is  assumed  in  this  algonthm  that  vim  is  known  for 
all  n.  For  procedures  to  estimate  v|n)  and  other  details,  see  [3] 

*S  IS  set  to  -1  to  rotate  an  equation  out  of  the  estimate 
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Abstract:  A  class  of  algorithms  is  presented  for  training  multilayer  perceptrons  using  purely  ■linear  ’  tech¬ 
niques.  The  methods  are  based  upon  linearizations  of  the  network  using  error  surface  analysis,  followed  by  a 
contemporary  least  squares  estimation  pro.  -ilure.  Specific  algorithms  are  presented  to  estimate  weights  node¬ 
wise,  layer- wise,  and  for  estimating  the  entire  set  of  network  weights  simultaneously.  In  several  experimental 
studies,  the  node-wise  method  is  superior  to  back-propagation  and  an  alternative  linearization  method  due  to 
Azimi-Sadjadi  et  al.  in  terms  of  number  of  convergences  and  convergence  rate.  The  layer  and  network-wise 
updating  offer  further  improvement. 

1.  Introduction 

This  paper  introduces  a  new  class  of  learning  algorithms  for  feedforward  neural  networks  (FWi  with  im¬ 
proved  convergence  properties.  In  spite  of  the  nonlinearities  present  in  the  dynamics  of  a  FNN,  the  learning 
algorithm  is  purely  “linear”  in  the  sense  that  it  is  based  on  a  contemporary  version  (see  [1])  of  the  recursive 
least  squares  (RLS)  algorithm  (e.g.  [2]).  .Accordingly,  unlike  the  popular  back-propagation  algorithm  used  to 
tram  FNNs  (.3,  4],  the  new  learning  algorithm  and  its  potential  variants  will  benefit  from  the  well-understood 
theoretical  properties  of  RLS  and  VLSI  architectures  for  its  implementation. 

A  FNN  is  an  artificial  neural  network  consisting  of  nodes  grouped  into  layers.  In  this  paper,  we  consider 
a  two-layer  network^,  but  the  generalization  of  the  method  to  an  arbitrary  number  of  layers  is  not  difficult. 
Working  from  the  bottom  up,  we  shall  frequently  refer  to  layers  zero,  on.^.  and  two  as  the  “input,  ”  "hidden. " 
and  '“output”  layers,  respectively.  Each  node  above  the  input  layer  in  the  FNN  passes  the  sum  of  its  weighted 
inputs  through  a  non-linearity  to  produce  its  output.  The  inputs  to  the  input  layer  are  the  external  inputs  to 
the  network,  and  the  outputs  of  the  output  layer  are  the  external  outputs 

The  number  of  nodes  in  layer  i  is  denoted  .V,,  with  .\'o  indicating  the  number  of  input  nodes  at  the  bottom 
of  the  network.  The  weight  connecting  node  j  in  the  hidden  layer  to  node  k  in  the  output  layer  is  denoted  j. 
The  weight  connecting  input  node  I  to  node  j  in  the  hidden  layer  is  denoted  ic'  ,.  We  denote  by  .V  the  number 
of  training  patterns  of  the  form 

{(xi(n),  X2(n) . x,v,(n):  fi(n).  <2(n) . f.Vj(n)).  11=1.2 . V)  ,  (1) 

in  which  ri(n}  is  the  input  to  the  l‘^  node  in  layer  zero,  and  ft(n)  is  the  target  output  for  node  k  in  the  output 
layer  (output  desired  in  response  to  the  corresponding  input).  The  computed  outputs  of  layer  two  [one]  in 

response  to  Xi(n), . . . ,  i^,(n)  are  denoted  yi(n) . j/.v,  [Vifu) . lA'J  Finally,  we  need  to  formalize  the 

nonlinearity  associated  with  the  nodes.  Consider  node  k  in  the  output  layer.  For  given  weights,  wt  j ,  J  E  [1 .  .Vi], 
the  output  in  response  to  the  n*'*  input  is 


yt  1  u  1 


(2) 


m  which  .S(  )  is  a  differentiable  nonlinear  mapping.  For  future  purposes,  we  define  .s'(  )  to  be  the  derivative  of 

.^■(•).  For  convenience,  we  also  define  litin)  ='  jt/jl”)  Clearly,  utln)  is  the  input  to  node  k  in  the 

output  layer  in  response  to  pattern  n.  u[(n)  is  similarly  defined  ais  the  input  to  node  /  in  the  hidden  layer 
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the  National  Science  Foundation  under  Grant  No.  .\IIP-9016T34.  Mr.  Hunt  was  also  supported  by  a  fellowship  from  the  I  mversity 
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^Some  authors  might  choose  to  call  this  a  thr-e*  layer  network.  We  shall  designate  the  bott  m  layer  of  ‘nodes  ’  as  ‘layer  zero  ' 
and  not  count  it  in  the  total  number  of  layers  Layer  zero  is  -a  set  of  linear  nodes  which  simply  pass  the  inputs  unaltered. 
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\fany  training  (weight  estimation)  algorithms  exist  for  this  type  of  network  (eg.  [3]  -  Azimi-Sadjadi  89.  The 
most  popular,  the  back-propagation  algorithm  [3],  [4],  performs  satisfactorily  in  some  cases  if  given  enough  time 
to  converge.  However,  the  literature  abounds  with  example  applications  in  which  back-propagation  convergence 
is  too  slow  for  practical  usage  (e.g.  see  [8]).  One  attempt  to  develop  faster  training  methods  is  represented  by 
the  class  of  algorithms  in  which  the  network  mapping  is  "linearized”  in  some  sense  in  order  to  take  advantage 
of  linear  estimation  algorithms.  It  is  with  this  class  of  algorithms  that  this  paper  is  concerned. 

2.  Linearization  Algorithm 

The  fundamental  training  problem  for  the  two  layer  FNN  is  stated  as  follows:  Given  a  set  of  .V  training  pat¬ 
terns  as  in  (1),  find  the  network  weights  which  minimize  the  sum  of  squared  errors.  E  =  ^^=1  IZit=i  'i  I- 

i/i(n))“, where  the  weights  A(-)  are  included  for  generality.  For  a  given  set  of  training  pairs,  E  is  a  function  of 
the  weights  of  the  network.  A  graph  of  E  over  the  weight  space  is  frequently  called  an  error  surface.  Ideally,  a 
training  algorithm  would  find  the  weights  corresponding  to  the  global  minimum  of  the  error  surface.  Training 
algorithms  usually  operate  by  sequentially  presenting  the  training  patterns  and  moving  the  weights  toward  a 
minimum  of  the  error  surface.  The  procedure  is  repeated  several  times  using  different  initial  weights  in  order 
to  locate  the  best  minimum.  Ideally,  ail  weights  will  be  altered  with  each  presentation  of  the  set  of  training 
patterns  so  that  the  weights  may  move  in  the  direction  of  steepest  descent.  In  this  case  the  algorithm  represents 
a  true  gradient  descent  approach.  In  practice,  however,  no  rerisonable  algorithm  exists  which  can  simultaneous 
change  each  weight  in  the  network.  In  fact,  the  popular  back-propagation  algorithm  works  on  only  one  weight 
at  a  time.  One  of  the  principal  benefits  of  the  method  to  be  presented  here  is  that  many  weights  can  be 
simultaneously  updated. 

The  linearization  technique  adopted  in  this  work  can  be  explained  in  terms  of  errot  surface  analysis.  In 
effect,  for  a  present  set  of  weights  and  a  given  training  pattern,  we  construct  a  "linearized”  network  vvith  an 
error  surface,  say  E,  which  is  "similar"  in  some  sense  to  E  in  a  neighborhood  of  the  present  weights.  There  are 
two  similarity  criteria:  first,  that  the  magnitude  of  E  and  E  be  the  same  at  the  present  weights;  and  second, 
that  the  derivatives  of  E  and  E  with  respect  to  the  weights  to  be  updated  be  the  same  at  the  present  weights 
(since  the  other  weights  are  not  altered,  it  is  not  necessary  that  the  derivatives  with  respect  to  those  weights 
match). 

Let  us  digress  momentarily  from  the  simple  two  layer  network  and  use  more  general  description.  Suppose 
that  the  weights  connected  to  one  or  more  nodes  in  layer  L  are  to  be  updated  simultaneously^.  This  may  include 
as  few  as  one.  and  as  many  as  all.  nodes  in  layer  L.  Denote  the  set  of  such  selected  nodes  by  A*.  Denote  by  .VI 
the  set  of  all  nodes  above  layer  L  to  which  any  node  in  A”  is  connected,  directly  or  indirectly.  Let  all  weights  not 
connected  to  nodes  in  ,V"  and  .VI  be  fixed  at  present  values'*.  Then  it  is  shown  in  [9]  that  a  "linearized"  network 
whose  error  surface  E  is  similar  to  E  in  the  senses  above  is  constructed  by  replacing  the  nonlinearity  5(  )  for 
f>ach  node  in  .V”  and  .Vf  by  a  linear  approximation,  say  .>'(•),  consisting  of  the  first  two  terms  of  a  Taylor  series 
around  the  "present"  value  of  the  node's  input.  For  example,  suppose  the  output  node  is  to  be  linearized 
with  respect  to  the  n''*  training  pattern.  Let  wt  j  denote  the  present  value  of  weight  u’t  y.  Then. 


■v. 


.V, 


."■(fi)  3:  ,s-(u)  =  .8'  I  ^  (rj)  [u  -  ^  a>,ji/'(n)] -I- .S'  I  ^  ti>.;y'(n) 


V  =  ‘ 

.v, 


5j^trt;y'(n)  -  E  I  ^  u>.;yj(n)  Y^Wk.jy'jin) 


G  =  ' 


.V, 

j  =  i 


i:d 


ci  F 

=  Kk{n)u  -f  hki’D. 


In  fact,  since  E(u)  =  5(u)  if  u  is  the  input  corresponding  to  the  present  weights,  any  node  not  in  .V’  or  .Vf  may 
also  be  linearized  with  no  effect  on  the  solution.  Therefore,  we  may  assume  without  loss  of  generality  that  the 
entire  network  is  linearized,  even  if  only  a  portion  of  the  weights  is  to  be  updated. 

It  will  become  clear  below  that  once  the  network  is  linearized  by  replacing  the  operation  E(  )  by  S{  }  in  all 
\pprnpriate  nodes,  in  principle  any  least  square  error  algorithm  can  be  used  to  update  the  weights,  .\lgorithm,' 
b.a.sed  'H  similar  ideas  for  updating  weights  one  node  at  a  time  are  given  by  .Azimi-Sadjadi  et  al.  [o]  (henceforth. 
1-"'  algorithm  i  and  by  Hunt  and  Deller  [9],  The  former  is  based  on  the  conventional  RLS  algorithm  i2i  with  a 

If  \nv  a  nod^  is  to  updated,  then  every  wei^^ht  connected  to  that  node  must  be  updated.  This  *  onstraint 

IS  >r'hn.\fiiy  b*»n'»firiai.  sm'  e  it  implies  the  ability  to  simultaneously  update  more  than  one  weight. 

*  In  am  ases  it  is  p.^ssible  to  update  weights  in  different  layers  simultaneously.  We  discuss  one  case  at  the  end  of  this  serr  i.  ri 


2 


forgetting  factor,  while  the  latter  employs  a  contemporary  QR  decomposition  algorithm  [1,  10]  for  significant 
performance  improvement.  The  view  of  the  method  taken  above  allows  us  to  to  further  exploit  the  linearization 
by  complete  layer-vise  updating  of  weights  for  even  further  improvement.  Let  us  pursue  this  layer-wise  approach. 

Suppose  we  wish  to  update  all  weights  in  the  output  layer  simultaneously.  VVe  must  linearize  all  output 
nodes  (and  may  arbitrarily  linearize  any  other  nodes).  For  node  k  in  the  output  layer,  the  output  in  response 
to  input  n  is  computed  as  in  (2).  Let  y*.(n)  represent  the  output  of  node  k  after  5(uit(n))  as  been  replaced  by 
5(ut(n))  =  KkUk(n)  +  b^.  Accordingly. 

V,  .Vi 

y*(u)  =  A.'t(n)[^  j  (/(in)i  -  6t(n)  or  Zkin)  =  Kk(n)['^Wk:,jy'j{n)]  (4) 

]  =  '■  j  =  1 

with  it(n)  1=^  yt(n)  —  6t(n).  We  spe.ak  of  rhe  rightmost  form  in  (4)  as  descriptive  of  a  linearized  node  since 
the  output  is  a  purely  linear  combination  of  the  inputs  to  the  node.  The  network  with  all  appropriate  nodes 
linearized  will  be  called  the  linearized  network.  Since  yt(n)  =  yk(n)  at  the  present  weights,  the  error  at  the  k''^ 
node  will  be  the  same  for  the  linearized  and  original  network  if  the  target  value  for  Zk(n).  say  tkin),  is  taken  to 
be 

/^.i  n) '=  /(.(n)  -  6it(n)  lo) 

and  the  linearized  inputs  to  node  k  at  pattern  n  are 

=  ATlniy'ln).  j  =  l.2 . (6) 

Note  that  the  linearized  inputs  are  dependent  upon  k.  so  that  we  have  effectively  incresised  the  number  of 
training  pairs  by  a  factor  of  .VS, 

The  problem  has  effectively  been  reduced  to  one  of  estimating  weights  for  a  single-layer  linear  network.  In 
order  to  simultaneously  update  the  all  weights  in  the  output  layer,  the  system  of  .V  x  N2  equations 

^  I 

tk{n]  ik -n'k  ] .  <:=1.2 . V2  n=l,2 . V  (7) 

)  —  t 

must  be  solved  for  the  least  square  estimate  of  the  .\i  .<  .Vo  weights  Wk.]-  £  [!■  N'o]  j  £  [!■  -Vij.  However,  since 

all  weights  in  the  hidden  layer  are  fixed,  the  outputs  !/^(n)  are  independent  of  k.  This  means  that  the  equations 
indexed  by  different  values  of  k  are  independent  of  one  another,  and  the  sets  of  weights  connected  to  different 
(Outputs  may  be  updated  independently  In  the  output  layer,  therefore,  there  is  no  theoretical  difference  between 
layer-wise  and  node-wise  updating  This  is  not  true  at  lower  layers,  however,  as  vve  now  show  for  the  hidden 
layer  of  the  present  network. 

To  update  all  weights  in  the  hidden  layer  simultaneously,  the  weights  in  the  output  layer  are  fixed  and  alt 
nodes  in  the  network  must  be  linearized.  The  outputs  of  the  hidden  layer  with  .s'(  )  replaced  by  .5'(  )  are  given 
by 

^■(n)  =  A'](n)[^  a' ,r,(ri:i| -H  j  =  1,2 . V. .  (  s) 

!=1 

Substituting  (8)  in  the  leftmost  expression  in  (4)  results  in 

.V,  V,  v, 

yk(n)  -  Kk{n)uk  jhUn)  +  bk{n)]  =  '^  ^[Kk{n)wk.jl\jin)xi{n)]w', j.  (9) 

;=1  j=l,=l 

.\s  above,  we  can  now  view  the  problem  as  one  of  training  a  single-layer  linear  mapping  with  target  outputs 
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t'l^in]  =  tkin)  -  h'kiniu  k  jb'An)  +  6t(n)]  (  10) 


x).  ,  ,|n)  =  l\kln)Uk  J  I'tji’nix’ln). 


(ID 


2 


and  inputs 


The  weight  estimates  for  j  t  [1.  -Vi]  /  €  [1.  -Vq]  comprise  the  least  square  error  solution  to  the  system  of 

equations 

•Vi  .Vo 

t'i,{n)  =  k  =  1.2 . .Vt  n=1.2 . V.  (  r2) 

j  =  i  i=i 

Unlike  the  output  layer,  we  see  that  the  problem  cannot  be  decomposed  into  separate  solutions  for  sets  of 
weights  connected  to  individual  nodes  in  the  hidden  layer.  This  is  a  reflection  of  the  fact  that  all  weights  in  the 
hidden  layer  are  coupled  through  their  "mixing"  in  the  output  layer.  This  means  that  the  simultaneous  solution 
for  all  weights  in  the  hidden  layer  should  be  beneficial  with  respect  to  a  node-wise  solution.  Indeed  we  will  find 
this  to  be  the  case  in  the  experiments.  Of  course,  this  same  intra-layer  dependence  of  weights  would  continue 
if  there  were  further  hidden  layers  to  be  considered. 

■Vote  that,  for  a  fixed  ifc,  the  inputs  to  the  linearized  network,  x'(n).  n  G  [I  .V],  are  most  conveniently  viewed 
as  two-dimensional  (indexed  by  couples  ij.t)).  There  are  .V  such  "grid”  inputs  for  each  k.  paired  with  the  .V 
values  of  t'^(ti).  If  there  were  further  hidden  layers  in  the  network,  we  would  find  that  the  effective  inputs  would 
continue  to  increase  in  dimension  Further,  it  is  noted  that  the  role  of  k  in  (12)  is  somewhat  superfluous.  In 
principle,  the  index  is  used  to  keep  track  of  which  of  .Vt  outputs  in  the  linearized  network  is  being  considered 
However,  the  training  pairs  n).  j- ^  ,  .ml.  ^^(n)).  k  €  [l.-Voj  n  £  [U can  be  rcodexed  In’ 

mapping  pairs  (k.n)  — *  i  so  that  the  training  pairs  may  be  written  {t  (i):  i,  [(t) . j  i  £  [1.  .V  x  .V^]. 

Of  course,  an  identical  system  of  equations  to  I  12)  results,  but  the  linearized  network  may  be  viewed  as  a  itnyie 
output  linear  layer  with  .V  x  training  pairs. 

Updating  of  some  subset  of  the  weights  in  the  hidden  layer  (in  particular,  “node-wise”  as  in  the  .A.-S  algo¬ 
rithm)  IS  tantamount  to  solving  the  subsystem  of  (12)  corresponding  to  the  desired  weights,  introducing  the 
updated  values  into  the  system,  solving  for  the  next  desired  subset,  etc.  Clearly,  this  will  result  in  a  different 
solution  than  the  simultaneous  solution  In  terms  of  the  error  surfaces,  this  process  consists  of  continually  up¬ 
dating  the  error  surface  as  "partial"’  information  becomes  available,  then  moving  in  the  direction  of  the  gradient 
with  respect  to  a  new  subset  of  weights  in  the  updated  surfaces.  Intuitively,  movement  ‘‘at  once""  with  respect 
to  the  "complete"'  gradient  would  seem  to  be  a  preferable  procedure.  Indeed,  the  later  operation  corresponds 
to  the  simultaneous  updating. 

The  linearization  allows  us  to  approximate  the  error  surface  of  the  n"iilinear  system  for  only  a  small  neigh¬ 
borhood  around  the  present  weights.  Because  of  the  criteria  used  to  con>truct  £",  the  weights  will  be  changed 
in  the  direction  of  the  true  gradient  in  the  nonlinear  space,  but  will  move  to  the  minimum  of  £  which  may  be 
quite  far  from  the  neighborhood  over  which  £"  s:  £"  .Accordingly,  the  weights  must  be  allowed  to  change  i^nly  a 
small  amount  using  the  training  patterns  of  the  linearized  system  If  the  linearized  procedure  results  in  a  large 
'•hange  of  weights,  measures  must  be  taken  to  decrease  the  .•’deration.  The  updating  procedure  is  repeated  until 
changing  the  weights  does  not  result  in  a  decretise  in  error  The  algorithm  proceeds  as  follows:  linearize  the 
system  around  the  present  weights,  change  the  weights  by  a  small  amount  to  decrease  error,  then  repeat  the 
procedure  This  is  done  until  changmg  the  weights  does  not  decrease  the  error  or  a  maximum  on  the  number 
of  linearizations  is  reached. 

For  the  same  reason  that  simultaneous  /ai/f r-wise  estimation  of  weigh'-s  is  beneficial,  we  should  expect  even 
more  benefit  from  complete  network  updating  if  such  were  possible.  It  follows  from  the  developments  above 
that  entire  network  updating  is  possible  for  at  least  one  case  If  there  ts  a  ■angle  node  in  the  output  lager  "f  tht 
network,  let  =  1  and  define 

u']  ,  =  u>  jub  i  =  U'l  ;U''  q  I  1 3  I 

From  (9)  it  follows  that 


.V'l  \o  -Vi 

(yi(ti)  -  ^i(ti))  =  ^  ^[A'i(n)/V;(n)x,(n)]u.']  ,  +  ^(A'l ( ti  )6' ( n  )]ti,'i  j .  i  1  (  i 

;  =  I  i--i  !  =  \ 

This  can  be  interpreted  as  an  attempt  to  train  a  single  linear  layer  with  one  output  and  (  .Vq  «  .V,  i  ^  .\  • 
inputs.  In  this  case,  there  will  be  only  .V  linearized  training  patterns.  The  system  can  be  solved  for  U  ;  .  uid 
u  l  j  £  [1.  .V,]  /  £  1 1,  ,V,i]  and  I  13)  can  be  used  to  solve  for  u;'  j  £  [1.  A’l]  I  £  [1,  .Vr)]. 

3.  Experimental  Results 

The  results  given  in  this  section  compare  five  trainmg  strategies  for  a  FNN.  These  are;  1.  Conveniional 
hack-propagation  I  no  linearization  in  the  sense  described  here,  weight-wise  updating);  2.  .\-.S  algorithm  incvi' 
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Impementation  — * 

Back-Prop 

A-S  I  Node  Updating 

Layer  Updating  |  Network  Upaating 

1  No.  of  Convergences  — > 

S  j  78 

96  99 

Table  1;  Number  of  convergences  per  100  sets  of  initial  weights  in  experiments  with  the  XOR  network. 
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Figure  1.  Average  error  in  dB  for  the  XOR  implementations  vs.  iteration  number.  1 . Back-propagation.  2.A-.S 
algorithm:  3.  Node-wise  updating;  4  Layer-wise  updating;  5  Network-wise  updating. 

linearization,  then  conventional  RLS  with  a  forgetting  factor  for  node-wise  updating);  3.  Linearization  method 
described  above  with  node-wise  updating  based  on  QR  decomposition:  4.  Same  as  3  with  Lii/er-wise  updating;  5. 
Same  as  3  with  complete  network  updating.  The  two-bit  parity  checker  (XOR)  network  used  in  the  simulations 
has  two  inputs,  two  hidden  layer  nodes  and  one  output  node.  An  additional  node  is  added  at  each  layer  wln.'se 
"iitput  value  was  always  unity,  to  serve  as  a  bias  for  each  node  in  the  layer  above.  The  initial  weights  were 
chosen  .as  follow.s.  Each  weight  in  the  network  was  selected  randomly  from  a  uniform  distribution  ':'ver  the 
interval  [—1  1].  This  procedure  was  repeated  100  times  to  select  100  sets  of  initial  weights.  The  same  I'jO  sets 
of  weights  were  used  for  all  five  implementations.  For  the  back-propagation  algorithm,  a  factor  of  0  04  was 
used  in  the  weight  updating  equation.  The  A-S  algorithm  was  implemented  using  no  weight  change  constraints 
The  forgetting  factor  for  A-S  and  for  the  QR  decomposition  implementation  was  0.98  The  QR  decompo.sif ion 
implementation  used  a  weight  constraint  of  0.2.  meaning  that  the  weight  vector  associated  with  each  node  was 
allowed  to  change  at  most  by  0.2  in  Euclidean  norm  during  each  iteration.  The  layer-wise  updating  .algorithm 
has  a  forgetting  factor  of  0.3  and  a  weight  constraint  of  10.  The  network-wise  updating  algorithm  had  the  same 
forgetting  factor  and  weight  constraint  as  the  layer  case. 

Simulations  were  run  to  compare  the  number  of  times  each  implementation  found  weights  that  solve  the 
XOR  problem  for  the  100  initial  weight  sets.  The  results  are  shown  in  Table  1 

Simulations  were  also  run  to  compare  the  output  error  of  each  algorithm.  In  the  resulting  figures,  the  error 
in  dB  means  the  following:  Let  £(i)  be  the  sum  of  the  squared  errors  incurred  in  iteration  i  through  the  trainina; 
patterns,  averaged  over  the  100  initial  weight  sets.  Then,  plotted  in  the  figures  is  10  log(c:(  i  i/p )  (dB).  where  /i 
i.s  the  maximum  possible  error  in  any  iteration.  Figure  1  shows  the  errors  of  the  four  XOR  implementai lon.s 


4.  Conclusions 


A  new  implementation  for  node-wise  weight  updating  algorithm  for  feedforward  neural  networks  and  new 
algorithms  that  update  weights  layer-wise  and  network-wise  have  been  presented  in  this  paper.  The  QR  decom¬ 
position  implementation  hcis  been  shown  experimentally  to  be  superior  to  standard  recursive  equations  for  the 
node-wise  updating  algorithm.  The  layer-wise  and  network-wise  weight  updating  algorithms  were  developed 
to  improve  the  convergence  rate  and  the  speed  of  convergence.  Both  objectives  were  accomplished,  with  the 
layer-wise  weight  updating  algorithm  showing  a  significant  advantage  over  both  the  single  node  weight  updating 
algorithm  used  as  a  reference,  and  the  widely  used  back-propagation  algorithm. 
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